Navigating generative AI in drug discovery and data analysis: Seizing the opportunity and avoiding pitfalls

An illustration depicting the discovery of biomarkers for diagno

[Made/Adobe Stock]

Along with predictive AI, generative AI is emerging as a promising tool in drug discovery. Thanks in part to the rise of ChatGPT, interest in the technology in drug discovery is on the upswing. In March, a preprint appeared examining the potential to use generative AI to enable de novo antibody design. Also this year, the Japanese conglomerate Mitsui & Co. began working with NVIDIA to launch Tokyo-1, a project aimed at boosting Japan’s pharma industry with generative AI models. The initiative will give Japanese pharma companies and startups access to an NVIDIA DGX AI supercomputer, providing a shot in the arm to the country’s $100 billion pharma sector, which is the third largest globally.

As generative AI gains ground in pharma, businesses considering using the technology to speed up drug discovery should also take its potential drawbacks into considering. To that end, Ali Arsanjani, director of Google’s AI/ML division, says that understanding the procedures involved in putting generative AI into practice is key. These procedures include prompt-based prototyping, data generation, programmatic data labeling, model selection, model evaluation and interpretability.

The multifaceted promise of generative AI

In drug discovery, generative AI models’ ability to create new data samples based on the patterns found in their training data can help unveil novel molecular structures or drug candidates. Examples of generative AI include deep learning techniques like generative adversarial networks and variational autoencoders. Conversely, predictive AI uses existing data to make predictions. In pharma, the latter technique could, for instance, identify the most likely drug candidates from a database.

Accelerated drug discovery

Generative AI can predict novel drug candidates, optimize molecular structures and analyze large datasets. A few companies are already seeing significant time savings from the technology. For example, in January, researchers explained that they used the AI-powered protein folding prediction model AlphaFold to discover a novel CDK20 small molecule inhibitor in 30 days, publishing the results in Chemical Science.

Similarly, the German biotech firm Evotec announced a phase 1 clinical trial for a novel anticancer compound it developed with Exscientia, an Oxford-based firm that uses AI for small-molecule drug discovery. By using Exscientia’s ‘Centaur Chemist’ AI design platform, the companies identified the drug candidate in 8 months. For reference, the traditional discovery process often takes between four and five years, as Nature reported.

Another example of a firm using generative to accelerate drug development comes courtesy of Insilico Medicine, which announced in April that it had discovered a potent, selective and orally-available small molecule inhibitor of CDK8 for cancer treatment using a structure-based generative chemistry approach. The company used the Chemistry42 multi-modal generative reinforcement learning platform in the research, which was published in the American Chemical Society’s Journal of Medicinal Chemistry.

Cost savings

By automating some aspects of drug development, generative AI can help reduce labor-intensive tasks, potentially helping defray drug development costs. Additionally, automation can curb the need for manual labor and reduce human error rates. Cost savings, however, will depend on balancing AI implementation costs with efficiency gains, which might vary between companies and projects. In general, McKinsey estimated in 2020 that generative design, a specific application within the broader field of generative AI, could save 23% and 38% and yield cost reductions of 8% to 15%. Such potential cost savings, however, are not guaranteed, as they depend on the efficiency of AI implementation and the nature of the projects.

Improved collaboration

Generative AI also has the potential to foster interdisciplinary collaboration in complex problem-solving. As noted in Nature Biotechnology, generative AI and large language models (LLMs) can bolster productivity for data scientists and engineers. The technology can also harmonize data from heterogeneous sources and fuel collaborative data networks. LLMs can also standardize data from disparate sources, serving as a framework for integrating diverse data and concepts.

Navigating the Challenges and Opportunities

As the use of generative AI in drug discovery gains momentum, it is also vital to keep in mind the technology’s limitations, which include data privacy and ethical concerns, lack of standardization. There are also dependence on biased or incomplete data, so-called AI hallucination and reliability headaches.

Data privacy and ethical concerns

Because generative AI depends on enormous volumes of data, privacy concerns can arise concerning the potential misuse of sensitive data. Companies considering the use of generative AI should establish explicit policies, interact with authorities and create moral frameworks to guide its use. Google’s Arsanjani advises using the idea of “responsible AI” as a guiding principle, prioritizing fairness, accuracy, safety and security in AI applications.

Lack of standardization

The lack of rigorous standardized protocols and best practices can hinder the successful implementation of generative AI in drug development. This fact can lead to inconsistencies in results and difficulties in comparing findings across different studies. Organizations exploring generative AI can develop and refine standards by collaborating with researchers, other pharma partners and regulatory bodies. Without such standards, it will remain difficult to develop accurate, reliable and transparent models for hypothesis generation and developing structurally novel molecules. Diverse and representative datasets that capture a range of data sources are a necessity to minimize potential biases.

Dependence on biased or incomplete data

Just as the age-old garbage-in, garbage-out dictum suggests, generative AI models are reliant on the quality of the data they are trained on, as NIST has noted. If the data are biased or incomplete, the models can make flawed predictions. To mitigate that risk, organizations should invest in high-quality data curation, ensuring diversity in data sources and cross-validating with a variety of overlapping datasets. Additionally, organizations should collaborate to develop standardized protocols for data collection and sharing, while also adopting transparent documentation of data processing to minimize biases and improve the reliability of the model.

Hallucination and reliability concerns

Generative AI models such as LLMs can sometimes “hallucinate” or invent facts. This issue played a role in Facebook parent Meta taking its scientific LLM Galactica offline three days after unveiling it in a public demo late last year. Developing augmented language models (ALMs) with improved reasoning and reliability mechanisms may help address this issue, according to Nature Biotechnology. To improve the reliability of generative AI models, drug companies can incorporate expert knowledge and validation processes, such as iterative feedback from domain experts or reinforcement learning with real-world data.