Latent Dirichlet allocation

Latent Dirichlet allocation diagram image from Wikipedia.

High-quality clinical trial data serves as the foundation for analysis, submission, approval, labeling and marketing of a compound under study. Widely used throughout the industry, data cleaning ensures that the process deployed to collect data is consistent and accurate.

Challenges in collection include data errors during manual data entry, i.e., spelling and transcription mishaps, range, and text errors, which impact coding. Automated edit checks can prevent the entry of inaccurate information, but they cannot detect all potential data entry issues.

Numerous manually generated queries put pressure on time and cost. Applying AI (artificial intelligence) techniques to understand the context of these queries may improve automated edit checks and offer opportunities to add checks or processes to identify issues earlier in the studies. Additionally, applying machine learning (ML) to historic manual queries across different studies can improve the understanding of common issues across and within studies, bringing an even more targeted approach to process optimization for data cleaning.

Reducing manual queries via AI

It was necessary to review several clinical trial situations to determine how to use AI approaches.

In Study 1, a higher-than-expected number of queries were identified. Out of 21,103 queries, 7,560 or 36% were manual. Considering that the average cost of a manual query from start to close is about $200, the high percentage of manual queries added significant cost to the study.

Looking at the specifics of the manual queries, the data included the form and variable the query was raised on, the row the query was raised on, and the query message. This information provided the basis for further study with the aim of reducing the number of manual queries. However, there were questions to be researched, including whether or not we could identify themes in the manual queries without subjecting them to human bias.

The main challenge was that query messages were free text; a query with the same message could be written in different ways, and this required an approach that could extract information from these queries. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the topics or themes that occur in a set of documents, or this case queries, and was found to be a good approach for this particular challenge.

Deploying an LDA modeling algorithm

Latent Dirichlet allocation (LDA) is a topic modeling technique and form of unsupervised learning which can be used to model documents and their corresponding topics and terms. The goal is to extract semantic components out of the lexical structure of a document. Applied properly, LDA offered an opportunity to identify the themes in the manual queries. Its purpose is to reduce the number of data features to a manageable number before the process of classification. LDA also helps to learn each document’s topic distribution in a collection of documents. Therefore, looking across a set of documents, there may be commonalities in words that appear in each document, but these documents may have different numbers of these common words. This makes it challenging to identify and define the appropriate category for each document.

Via LDA, the mission was to determine if the number of manual inquiries could be reduced by understanding problematic forms. For example, could this technology result in more focused edit checks and could queries be auto-generated?

First, each query was tokenized or split into words with common data management words removed, i.e., confirm, verify, check, please, etc., and LDA applied to the queries to create a number of common topics for further review. Then, the results of the LDA were visualized in the context of the forms used to collect the data, and study experts consulted to bring a deeper understanding of the context.

Study results provide efficacy of AI approach

The visualizations provided the study experts with insights into the different topics, identifying the most common words in each topic together with a summary of the context, such as the form and variable that the queries within a topic were raised against. This enabled the team to understand what was driving queries within a topic, thus providing the basis to explore how these queries may be reduced, for example, through enhanced edit checks or through a rules-based approach to speed up the discovery of the potential data issue.

Rules approach to further study research

Based on the efficacy of AI and interpretation of the different topics, the next steps included investigating whether the generation of rules could identify some of the queries in the ‘top’ topics identified. This would be a critical foundation for progressing onward to applying this approach to more studies. It would uncover comparative overlap and differences in each study.

A rules-based scenario required the creation of a dataset to assess the impact of the implemented rules using a snapshot from Study 1. For example, work with a topic result and an expert to generate rules around the AE (adverse event) and sick day medications and then apply these rules to the data snapshot to understand how many manual queries are aligned with the results.

Study 2 was an extension study building on Study 1, plus results from an earlier study in the same clinical indication that would help understand overlap and differences. As expected, there was a large amount of overlap. However, the differences addressed in Study 1 were not the same as in Study 2.

Study 3 proved interesting. It was a completely different phase, study, sponsor, indication and population, yet it also revealed some differences and some overlap.

Potential of the AI approach as part of the general data management process

Finally, the potential of this approach as part of the general data management process was explored from the perspective of added value and efficiency gains.

Via LDA, the mission determined that the number of manual inquiries could be reduced by understanding problematic forms. Additionally, this technology would result in more focused edit checks and allow queries to be auto-generated.

The exercise demonstrated the proven benefits of applying AI to manual data queries generated during the data cleaning process. In addition, it provides the opportunity to significantly reduce the number of manual queries during a trial, increasing efficiencies and reducing costs.

Automation and AI techniques play a crucial role in managing and distributing clinical trial data. However, while machines may be data-driven and more accurate than manual approaches, human attributes are essential to provide the critical interpretation to understand the data.

Jennifer Bradford, PhD, is the director of data science at Phastar. She previously worked for the Advanced Analytics Group at AstraZeneca, leading the development of the REACT clinical trial monitoring tool, which she later customized and delivered to other sponsors as part of Cancer Research UK (CRUK). Within CRUK and in close collaboration with the Christie hospital she worked on EDC, app development and wearables data analytics in the context of clinical trials. She has a degree in Biomedical Sciences from Keele University and a bioinformatics Masters and PhD from Leeds University.

Sheelagh Aird

Sheelagh Aird

Sheelagh Aird, PhD, is the senior director of clinical data operations at Phastar. She has more than 30 years of experience in clinical data management, Sheelagh has directed and delivered projects in all phases of clinical trials across numerous therapeutic areas and data collection platforms. Sheelagh holds a BSc in pharmacology and doctorate in pharmacokinetics from the University of Bath. She has led Phastar’s Data Operations group since 2016.