O02_05

Clmpy: A platform for Chemical Language Model comparable training and structure generation ability

Shumpei NEMOTO *, Yasuhiro YOSHIKAI, Tadahaya MIZUNO, Hiroyuki KUSUHARA

Graduate school of pharmaceutical sciences, The University of Tokyo
( * E-mail: nemo88@g.ecc.u-tokyo.ac.jp )

Chemical Language Models (CLM) are natural language models trained on chemical structures. Since the groundbreaking research by Gomez et al. in 2016, CLMs have rapidly evolved, utilizing string-based representations of compounds (such as SMILES and InChI) as input [1]. A key strength of CLM is their ability to leverage sophisticated Natural Language Model technologies. By utilizing neural machine translation architectures, CLM enable representation learning of diverse chemical structures without the need for any auxiliary tasks. This capability has been applied in various cheminformatics tasks such as descriptor generation for Quantitative Structure-Activity Relationship (QSAR) studies. However, the mechanisms by which these models recognize and learn diverse chemical structures remain unclear. Furthermore, many studies construct models in unique environments, leading to a lack of standardized comparison. Here, we aimed to develop a platform that enables the training and comparison of multiple CLM, by leveraging our experiences in this filed [2, 3]. We implemented four model structures (GRU, GRU-VAE, Transformer, and Transformer-VAE) with two tokenizers (conventional tokenizer and Simplified Feature Learning (SFL)), resulted in 8 different models [4]. When trained on 30 million compounds obtained from the public compound database ZINC, all models achieved high translation accuracy exceeding 80%, with minimal differences in precision between models. These results demonstrate that our platform enables high-fidelity comparisons across different model architectures. In comparing the constructed models, we evaluated the accuracy rates for structures with chirality. Despite comparable overall translation accuracy, the GRU models outperformed the Transformer models in this aspect. This result aligns with previous reports suggesting that vanilla Transformer have a weakness in recognizing chirality [3]. Aiming for public and free accessibility, we have packaged the platform as a Python module. This package allows users to specify arbitrary parameters for training in their own environments, facilitating model comparisons under various conditions. It supports both interactive environments like Jupyter notebooks and command-line execution.
This research enables high-precision inter-model comparisons of CLM based on neural machine translation. We anticipate that this will deepen our understanding of chemical structure recognition by CLM and contribute to advancements in in silico drug discovery. Future work will utilize this platform to compare the predictive accuracy of downstream tasks (such as toxicity prediction) using different CLM and to analyze differences in the learning processes of each CLM.

[1] Gomez-Bómbarelli R.; et al, ACS Cent. Sci., 2018, 4, 268-276
[2] Nemoto S.; et al, J. Cheminform., 2023, 45, 15
[3] Yoshikai Y.; et al, Nat. Commun., 2024, 15, 1197
[4] Lin X.; et al, Brief. Bioinform., 2020, 21(6), 2099-2111