P03-20
Generalized Molecular Representation for Drug Discovery via Molecular Graph Latent Diffusion Autoencoder
Daiki KOGE *1, Naoaki ONO2, 3, Takashi ABE1, Shigehiko KANAYA2, 3
1Graduate School of Science and Technology, Niigata University
2Data Science Center, Nara Institute of Science and Technology (NAIST)
3Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology (NAIST)
( * E-mail: daiki-ko@ie.niigata-u.ac.jp )
In drug discovery using machine learning, since labeled data with specific molecular properties are limited, it is essential to construct a prediction model with high generalization performance (i.e., high prediction performance for previously unseen data) from limited labeled data. Modeling meaningful latent representations numerically from chemical structures can facilitate generalization of molecular property predictions. Such a molecular representation is designed by feature engineering and representation learning approaches. Although the space of all possible organic compounds is very enormous, a molecular representation to generalize for the entire compound space may accelerate drug discovery. Variational autoencoder (VAE) is one of the representative deep learning methods for constructing the molecular representation. Although the VAE is an appropriate method for representing a chemical structure as a single latent variable vector, the standard distribution used as a prior distribution of latent variables oversimplifies the molecular representation and may affect the generalization performance of the property prediction.
This study aims to learn a molecular representation that improves the generalization performance of a molecular property prediction model using a deep generative model that integrates a graph transformer autoencoder and denoising diffusion probabilistic model (DDPM). Our proposed model maximizes the marginal likelihood with a smooth and multi-modal probability distribution generated by DDPM as the prior distribution.
We constructed prediction models for quantum chemical properties, such as HOMO energy, physicochemical properties such as solubility, and biochemical properties, such as biological activity. We analyzed the generalization performance of our model and several existing models using the widely applicable information criterion (WAIC) and the widely applicable Bayesian information criterion (WBIC). Our method demonstrated higher generalization performance compared to the several existing methods. Additionally, this method efficiently identified desirable molecules in a chemical structure search experiment using Bayesian optimization with a Gaussian process regression model.
The results confirm that our method is effective in constructing a prediction model with high generalization performance from limited data.