O06_01

3D structure-based chemical foundation model to predict the bioactivity and toxicity

Tsuyoshi KIMURA *, Yoshihiro YAMANISHI

Graduate School of Informatics, Nagoya University
( * E-mail: kimura.tsuyoshi.p9@s.mail.nagoya-u.ac.jp )

It is extremely challenging to identify drug candidate compounds with desired properties [1]. The number of possible organic compounds is estimated to exceed 10⁶⁰; thus, experimental validation of all possible compounds is infeasible due to time and cost constraints. Machine learning approach plays a key role in the compound screening. However, labeled datasets of compounds with known properties are small in most cases, which makes it difficult to generalize the machine learning model well with the small labeled datasets. To address the problem, the foundational model has been receiving much attention. Most previously developed foundational models use a self-supervised pretraining on large unlabeled datasets such as SMILES strings or molecular graphs, followed by fine-tuning on smaller and labeled datasets. However, molecular properties such as bioactivity and toxicity depend heavily on the 3D structures of compounds; thus, it is difficult to accurately predict the molecular properties with foundation models that ignore the 3D structures of compounds.
In this study, we propose a 3D structure-based chemical foundation model to predict various molecular properties (e.g., bioactivity and toxicity) of drug candidate compounds. Our foundational model is pre-trained based on 3D structures of compounds with an optimization technique that is more accurate than RDKit, and is fine-tuned for various molecular property prediction tasks. The optimization of compounds involves a supervised learning to match randomly generated 3D conformations with labeled conformations [2]. The pre-training task involves a self-supervised learning, where the inputs are 3D structures with randomly missing atoms and the structures are reconstructed from the 3D coordinates of the remaining atoms. In the results, we confirmed that our proposed method worked equally or better than existing methods, even with an extremely smaller dataset. The proposed model is expected to reduce the necessary computational resources and be applicable to complex molecular property prediction tasks.
[1] Regine Bohacek, et al. Med. Res. Rev., 1996.
[2] Gengmo Zhou, et al. The Eleventh International Conference on Learning Representations, 2023.