P03-18
Development of an Integrated Machine Learning Model Incorporating Compound-Protein Information for Design and Prediction of Small-Molecule Modulators of PPIs
Tsubasa NAGAE *1, 2, Kohei SODA2, 3, Kazuyoshi IKEDA4, 5, Masashi TSUBAKI2, Kentaro TOMII1, 2, 3
1Graduate School of Medical Life Science, Yokohama City University
2Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology
3Graduate School of Frontier Sciences, The University of Tokyo
4Center for Computational Science, RIKEN
5Faculty of Pharmacy, Keio University
( * E-mail: w235434d@yokohama-cu.ac.jp )
Background
Protein-protein interactions (PPIs) have emerged as promising targets in drug discovery, offering potential to address previously undruggable diseases. PPI modulators, including both inhibitors and stabilizers, often exhibit unique properties that diverge from traditional drug-like molecules. This brings both opportunities and challenges in drug development.
Existing indicators and computational methods used for conventional drug targets are often inapplicable to PPI modulators due to their unique properties. This has led to the development of specific indicators for PPI-targeting compounds. However, these PPI-focused indicators have significant limitations: they mainly emphasize inhibitors, often struggle to accurately evaluate stabilizers, and frequently lack integration of specific protein target information.
Given these challenges, the field of small-molecule modulators of PPIs urgently requires novel computational approaches for more effective design and prediction of both inhibitors and stabilizers.
Methods
We constructed a novel dataset of PPI stabilizers and inhibitors, including information on their target protein pairs. Data collection involved database mining and literature review, with each entry comprising a triplet of compound information (SMILES) and amino acid sequence information for the target protein pair. For stabilizers, where existing databases were insufficient, we extracted candidates based on structural information of protein-ligand-protein triplets from the Protein Data Bank (PDB) to augment the dataset.
In our machine learning model, we treated stabilizer entries as positive examples and inhibitor entries as negative examples. We primarily used representations generated by compound language models and protein language models as input features to capture the characteristics of PPI modulators.
Results
We constructed a dataset containing over 4,000 entries of triplets. Statistical data analysis revealed a slight bias towards negative examples, which we addressed by curating a balanced dataset for machine learning purposes. Then, we built a machine learning model to classify PPI stabilizers and inhibitors, and evaluation of its performance confirmed that classification was possible with satisfactory accuracy.
This study suggests the possibility of a new approach to designing small-molecule modulators of PPIs. Our model, which incorporates both compound and protein target information, may contribute to deepening our understanding of PPI modulator prediction and design.