X-ChemEarly in my career, my manager used the phrase in the above headline to highlight the difficulty inherent in drug discovery. Over the ensuing years, I have seen that statement repeatedly confirmed by the brutal attrition in the discovery and development of new drugs. There are so many variables that can kill a drug discovery project — ranging from target validation and hit generation to off-target effects and formulation challenges — and that’s before even entering the clinic, where a whole new set of attrition factors arise. The number of variables to be simultaneously optimized is immense. One is never quite sure if it is even possible to thread the needle and arrive at a global optimum. It is a testament to the grit and persistence of drug discovery scientists that we have found as many lifesaving drugs as we have.

As a multiparameter optimization problem, drug discovery is perhaps the most challenging example we face. But recent advances in computational power and data science have given the world new tools to tame such complex problems. Artificial intelligence (AI) can be defined as “a system’s ability to correctly interpret … data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.” In recent years, AI has achieved great success in application to many long-standing challenges, such as language translation, image recognition and game playing (e.g., chess or Go). However, one of the most spectacular achievements of AI in the biological sciences has been AlphaFold and its ability to successfully predict protein folding from primary sequence.

All these stunning successes have a vast corpus of external data for the AI to “learn” from. In the case of language translation, the corpus was millions of pages of translated text, such as that available from United Nations archives. For AlphaFold, the corpus was the numerous experimentally determined protein structures available in public databases (e.g., the Protein Data Bank or PDB).

How can drug discovery scientists benefit from these powerful approaches? The answer is not obvious for those at smaller companies or those who work on truly innovative and novel drug targets. Particularly for novel targets, vast collections of experimental data just aren’t available. That means that in order to take full advantage of the power of AI, experimental data must be generated de novo. Fortunately, the biological and chemical sciences have made great advances in throughput, scalability and miniaturization in the past 25 years. Techniques such as high-throughput screening (HTS) and affinity selection/mass spectrometry (ASMS) can produce large sets of bioactivity data quickly. But the leading platform for rapid data generation is DNA-encoded library (DEL) technology.

DEL technology has evolved since its inception 20 years ago to be a leading method of hit generation in the biopharma industry. Since DEL technology links chemical structure with DNA sequence, it can use the power of next-generation high-throughput DNA sequencing to characterize chemical outcomes. That means that experimenters can routinely generate 10 to 100s of millions of chemistry data points (via DNA sequencing and translation) in a single affinity selection experiment.

DEL and AI appear at first glance to therefore be a perfect combination. But can DEL data actually drive AI to generate useful predictive models? This topic was explored in a publication by X-Chem and Google in 2020, and the answer was definitely “yes.” Across three different targets, DEL data were fed into AI for model building, and the resulting models were effective at predicting novel active compounds from virtual libraries. These results show that DEL + AI can be effective for hit generation, which can have powerful impacts on drug discovery. More hits mean more opportunities to find chemical matter that can thread that attrition needle and develop into a clinical candidate.

At X-Chem, DEL data can serve as a starting point for predictive binding models. But to get the most from the data and AI, we need to layer in additional parameters such as solubility, permeability, brain penetrance, lipophilicity and many others. This allows the AI to focus on target affinity and all the other vital parameters that must be optimized to deliver a promising clinical candidate. Data to support model building around these parameters does exist in publicly available sources, but we have found that the consistency of those data are suspect. Unfortunately, this is common in the biochemical sciences. Much literature has been demonstrated to be irreproducible over the years, and it is well known that different labs can get different results for the same experiment (at least initially). Therefore, we have put extensive effort into developing techniques for cleaning and filtering publicly accessible data so that they can have maximal utility in model building.

The fruits of these efforts are ArtemisAI, an integrated platform of AI tools specifically designed for use in early-stage drug discovery. Platforms like ArtemisAI are ready-built to harness DEL and other data streams for building models that will accelerate early-stage drug discovery. The synergy between data-hungry AI approaches and data-generating experimental platforms like DEL is evident. While ArtemisAI can utilize DEL data, it also has pre-built models for various ADME parameters that allow utility beyond hit identification into hit-to-lead and lead optimization. For instance, it can also generate novel compounds, score them against multiple parameters, and interface with physics-based approaches to dock-generated compounds in cases where DEL data may be unavailable. The challenge for the next few years will be to apply these techniques to real-world examples of difficult drug discovery problems and show a positive impact. Given the continuing advances in experimental techniques and computing power, we should be able to make drug discovery as simple as rocket science … and not harder.

References

  1. Matt Clark, Ph.D.

    Matt Clark, Ph.D.

    Kaplan, A., et al. M. Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence. Business Horizons 62 (2019), pp. 15-25.

  2. Clark, M., et al. Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat Chem Biol 5 (2009),
    pp. 647-654. https://doi.org/10.1038/nchembio.211
  3. McCLoskey, K., et al. Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding. J Med Chem
    63 (2020), pp. 8857-66. doi: 10.1021/acs.jmedchem.0c0045

Matt Clark, Ph.D., chief executive officer, X-Chem is a world-recognized innovator and leader in the DNA-encoded library (DEL) field, leading the group responsible for the design and synthesis of early-iteration DELs. He was part of X-Chem’s founding team, and under his scientific leadership, the company developed from a niche chemical discovery platform to a world-leading drug discovery engine serving the biopharma industry.