O03_03

Open Molecule Generator: A Multipurpose Molecule LLM

David JIMENEZ BARRERO *

Elix, Inc
( * E-mail: david.jimenez@elix-inc.com)

The latest advancements in large language models (LLMs) have significantly expanded the range of possibilities, paving the way for new opportunities in the field of generative molecular drug design. Modern paradigms in the realm of generative models outlined the concept of Foundation Models[1]; often large-scale, pre-trained machine learning models that serve as a general-purpose tool for a wide range of tasks. These models are typically trained on vast amounts of diverse data and can be fine-tuned, adapted, or extended to perform specific tasks with relatively small amounts of additional data. Furthermore, leveraging techniques such as Low Rank Adaptation (LoRA)[2], fine-tuning can be made cost efficient. In this project, we created a large molecule-oriented Foundation Model, pre-trained with a semantically curated construction of the ChEMBL[3] dataset in order to be able to handle several downstream tasks that might require physico-chemical, structural, or activity-related information of molecules. To achieve this goal, each molecule in the pre-training dataset contains information such as molecular SMILES, physico-chemical properties, semantic description of substructures, and known activity with protein targets. The dataset was tokenized and the model trained with 6.5B tokens using next-token prediction. Once trained, it was tested in a Multi-Constrained Molecular Generation task, where the model aims to generate molecules that satisfy up to 26 physico-chemical requirements. The highest error achieved in Mean Absolute Error (MAE) was for the Molecular Weight property, with a relatively small deviation of 18.07 Da. from the target. Other interesting properties like logP and Quantitative Estimate of Drug-Likeness[4] achieved a MAE of 0.5 and 0.05 respectively. We expect to extend its functionality via fine-tuning to more specialized tasks.

[1] Competition and Markets Authority (2023). AI Foundation Models: Initial Report, https://assets.publishing.service.gov.uk/media/65081d3aa41cc300145612c0/Full_report_.pdf
[2] Edward J. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
[3] Anna Gaulton et al., ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012 Jan; 40(Database issue): D1100–D1107.
[4] G. Richard Bickerton et al.,Quantifying the chemical beauty of drugs. Nat Chem. 2012 Jan 24; 4(2): 90–98.