O05_04

Learning the Language of Life: Feasibility of Using LLMs to Understand Latent Characteristics of Proteins from Residue Structural Environments

Nina HOLSMOELLE *Kenji MIZUGUCHIGert-Jan BEKKER

Laboratory for Computational Biology, Institute for Protein Research, Osaka University
( * E-mail: n.holsmoelle@gmail.com)

In this project, we explored the feasibility of using Large Language Models (LLMs) to extract knowledge such as dynamic structural information from static protein structures. To uncover latent characteristics inherent within static configurations, we combined machine learning techniques with one-dimensional representations of local protein structures, assuming an inherent structural logic that is learnable.

Categorizing local environments of amino acids, we constructed datasets from the Protein Data Bank (PDB) and employed a Masked Language Model (MLM) for feature learning. From the analysis of the trained model’s high-dimensional residue representations, we concluded that the model has indeed been able to successfully acquire a partial understanding of the amino acids’ characteristics and structural properties.

The goal of this research is to advance our understanding of protein interactions and functions, which holds significant implications for medical and health-related issues, such as drug design and disease treatment. With our findings, we hope to introduce a novel methodology for integrating static and dynamic protein data, paving the way for innovations in protein modelling and biomedical applications.