Innovative drug discovery process through generative AI in a high-tech lab environment, illustrating the potential of advanced technology in medicine.

[Generative AI image from Tahsin/Adobe Stock]

Drug discovery, traditionally a labor-intensive process, often involves extensive computational work during experimental screening. Advances in AI, however, promise to streamline this process. To that end, a team from MIT and Tufts has introduced ConPLex, a computational model that uses large language model techniques, similar to those behind ChatGPT. The model analyzes vast amounts of text data to discern patterns and relationships among amino acids. The technique matches potential drug molecules to their target proteins without requiring complex molecular structure computation. The system’s efficiency allows it to sift through an array of more than 100 million compounds in a single day.

Bonnie Berger, head of the Computation and Biology group in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and one of the senior authors of the new study, explained over email how the ConPLex model could be adapted for a wider range of interaction predictions. Berger, joined by her research team, noted, “The current form of ConPLex relies on co-representing the protein and small molecule in a shared high-dimensional (embedding) space, with our model learning a representation scheme that places interacting proteins and drugs close together in this embedding space.”

In the context of machine learning, embeddings are vector representations of a particular data type. They map objects, such as words, proteins or drugs, into vectors of real numbers. Converting such objects into embeddings enables models such as ConPLex to manipulate these objects mathematically.

From patterns to predictions with ConPLex

A schematic of ConPLex

A schematic of ConPLex published on GitHub

The researchers designed the system to locate interacting proteins and drugs near one another in this shared, high-dimensional space. With the accurate integration of additional molecules into this space — consider an antibody modeled by AbMAP or a peptide represented by a protein language model — ConPLex could potentially be repurposed for a broader range of interaction predictions.

ConPLex overcomes limitations of previous computational models, which often struggled to distinguish between actual drug compounds and decoys — compounds resembling the drug but not interacting effectively with the target.

In addition, traditional models were error prone, predicting an interaction where there shouldn’t be one. To address that problem, ConPLex incorporates contrastive learning to differentiate between genuine drugs and imposters.

Contrastive learning is a type of machine learning technique that trains models to separate similar data points from dissimilar ones. The technique enables ConPlex to improve accuracy and efficiency in predicting protein-drug interactions.

Protein interaction prediction capacity

The ConPLex model draws from a database of more than 20,000 proteins, converting their amino-acid sequences into meaningful numerical representations. This encoding captures the correlation between sequence and structure, improving prediction  accuracy.

Another unique aspect of ConPLex is its ability to account for the dynamic nature of proteins and drug molecules, a vital feature for accurately predicting interactions. Berger’s team elaborated, “Rather than explicitly representing the protein 3D structure at an atomic resolution, ConPLex uses an implicit representation of proteins in a high dimensional space using the protein language model (and likewise for the small molecule).”

This representation, they believe, captures not only the standard protein structure but also its conformational flexibility. “Our machine learning model can effectively marginalize over multiple molecule conformations with this implicit representation, taking into account their flexibility and dynamics.”

“By working with this implicit representation, our machine learning model can thus effectively marginalize over several conformations of the molecule, accounting for conformational flexibility and dynamics,” the team continued. “However, we think there is exciting future work to be done explicitly representing conformational flexibility which could further improve model performance!”

In tests, ConPLex ran on multiple CPUs and one GPU

Berger’s team also shed light on the screening capabilities of ConPLex and its hardware requirements. “We ran our predictions on an Intel server with multiple CPUs, but using only a single NVIDIA A100 GPU.”
NVIDIA designed the data-center grade A100 Tensor Core GPU for deep learning workloads. Based on the Ampere GA100 GPU, the A100 is part of the NVIDIA data center platform and accelerates over 700 HPC applications and every major deep learning framework. Such hardware is common in academic computer science labs and is also available in cloud-computing platforms from vendors like AWS or Azure.

The MIT and Tufts researchers further tested ConPLex by screening a library of about 4,700 candidate drug molecules for their binding ability to a set of 51 protein kinases, a type of enzyme. From the top hits, they selected 19 drug-protein pairs for experimental tests. A total of 12 pairs had strong binding affinity. Four of these pairs exhibited extremely high affinity, suggesting that the drug concentration needed to inhibit the protein would be in the sub-nanomolar range. Such results underscore ConPLex’s potential for large-scale screenings and the identification of strong drug candidates.

Ensuring accessibility for the scientific community

In terms of selecting test cases for experimental validation, the team conducted an unbiased all-vs-all scan of 51 kinases against 4,715 small molecule drugs. From this set, they selected top predicted kinase/drug pairs based on specific criteria. “We focused on kinases that were predicted to interact with several drugs, selecting five such kinases,” the team noted. “Note that we did this selection without peeking at the labels of the kinases or drugs, and so were completely agnostic to their biological function or prior known interactions.”

To date, interest in the ConPLex mode has been considerable. “We have been invited by some industry folks to speak about our algorithm,” the team noted. “We have also embarked on a promising collaboration with Dr. Eytan Ruppin of the NIH on exploring drug targets for cancer.”

The researchers have made ConPLex freely accessible to the scientific community. Researches can install the software in terminal via the command “pip install conplex-dti.” It is also available on GitHub. ConPLex v0.1.0 is now in its pre-release stage and remains in active development.