O09_03
Development of a sustainable database for middle molecules using AI-driven data curation
Kazuyoshi IKEDA *1, 2, Tomoki YONEZAWA2, Masanori OSAWA2, Tsubasa NAGAE3, Kentaro TOMII3
1RIKEN Center for Computational Science, RIKEN
2Faculty of Pharmacy, Keio University
3Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology
( * E-mail: kazuyoshi.ikeda@riken.jp)
Medium molecules, including peptides, nucleic acids, and other medium-sized synthetic compounds, have attracted considerable attention as promising lead molecules for drug targets that are difficult to address using traditional drug discovery approaches (i.e., small molecules). These molecules have unique binding properties and are considered favorable candidates for modulating protein-protein interactions (PPIs). We aim to construct an interaction database of middle molecules (peptides, non-peptides, and nucleic acids) to accumulate information on target molecules that are difficult to discover for drug discovery. In our database, target interaction sites of middle molecules can be identified based on ligand binding site similarity data, and our AI technologies can predict interactions between targets and ligands with high accuracy.
We have undertaken a multifaceted approach to systematically collecting and curating medium-sized molecule data from open public databases. These include data on diverse classes of medium molecules, such as cyclic peptides, oligonucleotides, and peptidomimetics. In addition, we complement the interaction data by screening for PPI targets using our medium-sized compound library at Keio University.
In this study, we developed an automatic data curation method using AI that efficiently retrieves and integrates data from literature. In particular, we attempted to improve efficiency by applying large-scale language models (such as GPT) to the chemical curation of compounds. We also developed a protocol to predict interactions between compounds and targets and curate them interactively. Although these technologies currently have accuracy limitations, we will discuss their usefulness in improving efficiency compared to conventional methods.