In this episode, I dive into a critical issue plaguing many organizations: the management of data and knowledge in statistical analysis.

Drawing from recurring observations, I highlight a common scenario where data generated from analyses, ranging from patient-level data to study results like means, sample sizes, proportions, confidence intervals, and p-values, are stored in various formats such as tables, figures, and listings. However, the manner in which this knowledge is stored often proves inefficient and ineffective.

I discuss the prevalent practice of storing tables in disparate files, leading to challenges in accessibility and knowledge retention. 

I also talk about the following key points:
  • Inefficient data storage
  • Accessibility challenges
  • Knowledge retention issues
  • Impact on regulatory inquiries
  • Manual recreation of analyses
  • Solutions for optimization

Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!

We won’t send you spam. Unsubscribe at any time. Powered by ConvertKit

Learn on demand

Click on the button to see our Teachble Inc. cources.

Load content


Does Your Company Know What It Knows?

[00:00:00] Alexander: Welcome to a new episode of The Effective Statistician. Today I want to talk about some observations that I see here from again and again and again. And this is about how we Manage all the data that we have and here I’m not talking about, you know, the patient level data, the Adam data sets, the CDISC data set, all these kind of different things. I’m talking about Or the results that we have. The means, the sample sizes, the proportions, the confidence intervals, the p values, all of these different things.

[00:00:53] What I see, again and again, is the following. [00:01:00] We have an SAP or some kind of specification for our analysis, then we program this analysis, and then we store the knowledge that we have generated, tables, figures, and listings. TFLs, or whatever kind of acronym you use in your organization. Mostly tables, actually.

[00:01:28] When I think about TFLs, well, mostly people talk about tables. Less about figures, and maybe people still use also listings. But mostly tables. How do we store these tables? Well, usually these are rtf files, pdf files, word documents, all these kind of different typical things. And we put them into a safe space, [00:02:00] a production area, somewhere in our IT system.

[00:02:05] And of course, this is a well-managed system, hopefully it is, and only special people have access to this. Very often programmers, statisticians, maybe medical writing. What’s that about it? And how do we store them there? Well, I’ve seen all kinds of different things. Or heard about all kinds of different things.

[00:02:32] What I very often hear about is there is one document per table. Or one file per table. And so we have many, many different files. And if you’re kind of well organized, maybe there’s some kind of system how you name these. files. But I’ve also seen that people just number some 4, [00:03:00] 1635.

[00:03:03] That’s where we store all our knowledge. But in that form, of course, it is very limited use. Because Only very limited people have access to it. And all the, you know, knowledge around it, the narratives, all that is missing. So we transfer these tables into documents, into a CSR. into slide sets that we show, into posters, papers, dossiers, whatsoever, you know.

[00:03:42] And then we send them to the regulators that we work with, to the conferences, and so on. And that’s where very often Things stop from an access point of view for statisticians [00:04:00] and what happens thereafter. I want to talk, don’t want to talk about today. That’s for another episode. So if this is how the overall process looks like in your company, or if you’re working in a CRO and you work exactly like this, you have a problem.

[00:04:21] You have pretty. Because if you work like this, your company does not know what it knows. What do I mean by this? A company is a really, really big organization. And it is a fluid organization. There’s always things that are changing. You have New stuff coming on board. You have people that leave. you have people that have [00:05:00] so much on their plate, that they can’t just handle another thing.

[00:05:05] And if these are the people that have an overview of all the different results, you have a problem. Because your company, as an organization, can only leverage all the knowledge that you have generated. The means, the p values, the confidence intervals, the odds ratios, and all these kind of different things, if they know about it.

[00:05:36] So first, they, an organization, needs to know that this data, this evidence, actually exists. If you think like, Well, where’s the problem? I want to tell a couple of different stories. So the first story, it goes a little bit back into [00:06:00] history. And there were a couple of these neuroscience drugs. Lots of big companies were working on them.

[00:06:08] Yeah. If you look into the top companies with some ZNS history, all of them had these CNS drugs. And now, there was a side effect that was emerging. And the FDA asked these different companies, would you please provide all the evidence that you have about this side effect. And it was a really important side effect.

[00:06:38] And then the companies had a problem.

[00:06:42] They were starting. to look for all the different studies that they had up around these drugs. Yes, I’m not talking even about the analysis. I’m talking about the studies [00:07:00] themselves. I know that In various of these big organizations, there were study hunting teams, like, you know, people that were completely dedicated to exploring in these big organizations where all these different studies are.

[00:07:23] And there were several waves that went through these companies where different teams were hunting for these studies. And with every wave, they found new studies. Because at the time, a lot of these studies were run through the different affiliates. The Italian Affiliate, the Portuguese Affiliate, the Korean Affiliate, the Canadian Affiliate, the Affiliate in Brazil, and the Affiliate in Australia, and whatsoever.

[00:07:56] Dozens of different affiliates running [00:08:00] their own studies. And of course, how do affiliates run studies? Well, very often they outsource it. Ah, now you need to find all the different vendors. Where these different studies were run. Hmm, do these vendors still exist? Where have they stored their data? There are stories about these hunting teams finding piles of paper CRFs somewhere.

[00:08:32] Or CDs with encrypted data. But nobody knew the password anymore. And since then they had a study, but no data according to it, because they couldn’t break the encryption.

[00:08:49] Now if you think, well, now everything is organized much better in our global company, you know, we [00:09:00] have a, an overview of all the different studies and we have all the data in house. We have all the, you know, CDISC data in house, the ADAM datasets in house, and yes, we also store all the tables in house.

[00:09:16] Yeah, maybe it’s better now that at least you know how many studies, which studies you run. How much is it better? What happens if someone comes to you and say, Hey, have we analyzed This subgroup,

[00:09:38] and you may think like, well, let’s have a look into the SAP. Ah, hmm. There’s not just one SAP. There are multiple SAPs. There’s actually multiple studies. And for these different studies, we have Different database logs and for the [00:10:00] different database logs, we have the different CSRs, but we also have many different publications.

[00:10:08] Ah, yeah. And since there’s also these HTA analysis, ah, and since there’s this real world evidence team that is also working on analysis and direct comparisons. Yeah. We have done these as well.

[00:10:22] It is really, really hard to understand whether a specific analysis was already done. Does it already exist?

[00:10:35] Another question. Have we analyzed this? Specific item, subscale of a questionnaire.

[00:10:45] When I ask these kind of questions in big organizations, they don’t know. They can’t for sure say, well, yes, we have done that.

[00:10:56] Or, it really depends on just one [00:11:00] person. There’s this super brain, yeah, that works forever on this compound and that has a very, very good overview of all the different things. That we’re done. Oh, welcome. Really, really great. Then you have one person, and if that person moves to another project, leaves the company, is sick whatsoever, you have a problem.

[00:11:26] As an organization, then you don’t know whether you have done certain analysis. And of course, that leads to all kinds of different problems.

[00:11:39] When regulators come to you, when oversight comes to you and asks you about these kind of things, you don’t have an answer readily available. [00:11:52] Even if you know, yes, we have done subgroup analysis, can you [00:12:00] easily find all of them?

[00:12:01] Can you easily say, well, yes, we have done the subgroup analysis for these different endpoints, for these different studies, at these different database logs. And we also always used this definition for the subgroup. I’m not talking about, you know, gender. I’m talking about pretreatment or things like that. It’s much fuzzier, very often, what is pretreatment. Then can you locate all these analysis? And can you make sure that you find all of them? Not the most recent one, or the one of the studies that you know about, but all of them.

[00:12:48] And in a reasonable time. I’m not talking about here setting up a specific team or, you know, giving one person two weeks of work to [00:13:00] make sure that he’s going through all the different data sets, all the different tables.

[00:13:07] Do I have access to all of these? How often do you tap into the situations that you say, Oh, yeah you will find the analysis here? And then the other person comes back and says, Can you give me access? And then, you know, all of that. How fast does that work? Do you have it readily available?

[00:13:34] This is a big, big problem in our, in our organizations. Our organizations Don’t know what it knows. This is what I mean by this.

[00:13:48] And it gets more and more difficult. Because how often do you have a situation where for a specific [00:14:00] compound, or let’s say just for an indication to make it easier, you have one study, you have one CSR, you have one paper, and you have, let’s say, three, four posters. How often do you have that for a specific compound?

[00:14:17] I’m not talking about, you know, the compounds that just have entered phase one.

[00:14:23] I know much better the situations, that you have many studies, you have various CSRs, clinical study reports, with multiple iterations, because the study runs over several years, for example, and then you have a report after. Let’s say three months, you have another report after a year, you have another report after two years, and so on.

[00:14:50] You have, very often, hundreds of posters. I’ve seen publication plans with hundreds [00:15:00] of posters. Yes, a lot of them. Dozens of papers. And I’m not talking about, you know, just the papers from your team, but also the papers from all the other teams. You know, the people that work on real world evidence, the people that do evidence synthesis in terms of indirect comparisons, and then all the additional HD analysis that are needed.

[00:15:30] And if you’re not working in that space, yeah. A lot of additional analysis will be needed for the different HDA analysis. Just speak to your German colleagues and you will understand how much of that is needed.

[00:15:47] This is surely a very, very prevalent situation. Many studies, lots of different CSRs, and over time you [00:16:00] build these knowledge databases Thousands and tens of thousands of tables. If you’re managing your results, your means, your p values and so on, that way you are overburdening your organization. Because every time someone in medical writing, someone in clinical, someone in medical, someone in regulatory and so on needs to If you don’t understand something and can’t find it in the documents that they have, the clinical study reports whatsoever, they come to your statisticians and ask, can you please have a check whether We have done this analysis where it is, and can you please give me access to it or send it to me whatsoever? [00:17:00]

[00:17:00] And then your statisticians will spend time, a lot of time doing all these tasks. That’s just a waste of time. This is a really, really big waste. And it’s also pretty demotivating, yeah, to become the table finder, the table provider. If this is how you define your contribution, really? It is also really clunky than to reuse any of these results.

[00:17:39] If you then, you know, want to not just use it in your CSR, but in your publication whatsoever, you always need to manually transfer all these kind of different data into something new. And yes, that happens all the time. [00:18:00] People want to create, let’s say, a figure with the different Proportions coming from 5, 6, 7 different tables.

[00:18:12] Well, then they look through all these different tables, copy and paste or manually transfer that into an Excel spreadsheet and from there create a new figure. Wow! What a great process in our modern day world. And this happens again and again and again and again.

[00:18:36] What happens if there’s a typo? Well, you surely have some kind of peer review process that makes sure that all these manual things are created really, really great. I can’t believe it. We spent so much time and so many SOPs. On making sure that our [00:19:00] tables are correct, just to then hand them over, and then we have all these manual processes.

[00:19:08] I can’t believe it! Because what people in the end really look into very often, not the CSRs, they look into the publications, the slide sets that are presented to upper management, all these kind of other areas. And this overall chain of creating results is only as strong as the weakest link. And this weakest link is there at the end.

[00:19:37] So we spend a lot of time on making things. that are right, just to mess it up at the end. Another problem is that very, very often you have something missing in your tables. Let’s say you want to present the results and in your [00:20:00] table you have the proportions. And now you also want to show the risk difference.

[00:20:05] Hmm, no, we have not included that in our table. Hmm, what do we do now? So, version A, we calculate it by hand. And yes, that happens all the time. Especially if people want to create some kind of bar chart whatsoever out of it.

[00:20:24] Or, we analyze the data again, recreate a new table that now also includes the risk difference. Ah, no, the confidence, the people, the advisors ask for confidence intervals around these risk differences. Ah, we haven’t thought about that. Ah, so we need to redo the analysis and add the confidence intervals.

[00:20:48] Why is this journal? Asking for p values when we already have confidence intervals. Oh, okay. Let’s do another analysis. [00:21:00] And of course, each time, we need to check that the new analysis matches the old analysis. And each time we write new specifications, we go through all the process, we document everything, we have the updates, and all these kind of different things. Just to add some new summary statistics.

[00:21:24] And yes, that happens all the time. Just talk to, you know, the people that work in, in medical affairs. How often they have these kind of different problems.

[00:21:38] It’s even about the rounding’s. I know so many people that are so frustrated with one of these high ranked journals that absolutely requires you to have p values with four digits instead of three digits. Now [00:22:00] welcome if all your tables have Roundings for three digits. And now, your team has decided to submit your data to this journal.

[00:22:14] Ah, welcome, you need to rerun all the different analysis and report four digits. Yes, yes, do you think that’s silly? Yes, but it happens again and again and again. All these different things happen again and again and again, and it puts a huge burden on the statistics function. Makes the overall process really, really clunky. If you’re in that situation and you really want to improve it, then listen to the next episode, where we’ll talk more about how a solution could look like.

Join The Effective Statistician LinkedIn group

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.