O09_04

Designing an Information Infrastructure for the Integration and Utilization of Multimodal Bioactivity Information

Shuichiro MAKIGAKI *Mayumi KAMADA

Faculty of Future Engineering, Kitasato Institute
( * E-mail: makigaki.shuichiro@kitasato-u.ac.jp)

In the research and development of data science and artificial intelligence, data preparation and processing are the starting points and crucial factors for the accuracy of subsequent data analysis and predictive models. For drug discovery, researchers are being called upon to integrate their proprietary compound and bioactivity data with existing large-scale database information.

However, integrating databases has many challenges, especially when including natural compounds with complex structures and properties. The measurement data related to compounds are highly diverse, and their structures and properties differ on synthetic processes, making it difficult to manage them in a unified format. Many of these multimodal data are not integrated because of their multidimensionality and multi-layered property, and comprehensive utilization still needs to be fully realized. Moreover, when integrating different datasets, if unique identifiers for the data are common, they can be linked together. However, when dealing with private libraries or newly synthesized compounds, compounds that do not exist in existing databases cannot be linked.

This study proposes an approach to integrating these complex compound data. We start by using structural similarity clustering and common substructure alignment to identify and evaluate structural change processes. Then, we integrate databases by considering the relationships of fragment insertion, substitution, and deletion between compounds and their specific substructure relationships. To treat these complex relationships, we employ the Resource Description Framework (RDF). Although RDF is known as a graph data model and a component of semantic web technology, its adoption is also progressing in life sciences databases. As an application example, we demonstrate database integration by combining a subset of NPAtlas with ChEMBL using the proposed approach. We will provide specific examples of what kinds of searches can be performed and discuss the database's utility and the effectiveness of our proposed approach.

Through this study, we will formulate a new data model for integrating multimodal bioactivity data with public databases and for inter-modal collaboration. By integrating original data with existing databases, we aim to enable the use of individual activity information that was difficult to achieve in traditional chemical biology, leading to the expansion of the compound latent space.