O05_01

AlphaFold protein 3D structures enhance genome-wide scale compound-protein interaction prediction with deep learning

Yuga MORIYAMA *1Sae OKAMOTO2Tomokazu SHIBATA2Ryusuke SAWADA3Yoshihiro YAMANISHI1

1Graduate School of Informatics, Nagoya University
2Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology
3Graduate School of Medical and Dental Sciences, Okayama University
( * E-mail: moriyama.yuga.i8@s.mail.nagoya-u.ac.jp)

Keywords : protein 3D structures, ligand binding site, compound-protein interaction, graph neural network

In the early stages of drug discovery, the identification of compounds that regulate therapeutic target proteins of the disease is an important issue. However, experimental methods are costly and time-consuming. To solve the problem, machine learning plays a key role in compound-protein interaction (CPI) prediction. CPIs are greatly influenced by 3D structures of proteins; thus, it is ideal to consider protein 3D structures in the CPI prediction. However, most previous studies on CPI prediction have been using amino acid sequences as protein features, because only 14% of the proteins encoded in the human genome have fully determined 3D structures [1]. Recently, AlphaFold2 has made it possible to obtain genome-wide protein 3D structures from amino acid sequences, and comprehensive protein 3D structures are a useful resource for pharmaceutical research. Thus, the use of AlphaFold2 protein 3D structures may enhance the accuracy of CPI prediction.
In this study, we developed a deep learning-based method for genome-wide scale CPI prediction using 3D structures of proteins. First, compound structures were converted to graphs with atoms as nodes and bonds as edges, and the graphs were input to the graph neural network (GNN) to construct feature vectors of compounds. Second, protein structures are converted to three-dimensional interaction (3Di) alphabetic strings based on the distances and angles between amino acids with a discrete variational autoencoder, and the 3Di alphabets with 20 states representing the geometrical arrangement of amino acids were used to construct feature vectors of proteins. In addition, only protein 3D structures of the ligand-binding pockets were transformed into graphs with amino acids as nodes and spatial proximities as edges, and the resulting pocket-constraint graphs were input to the GNN to construct feature vectors of proteins. These feature vectors were input to a fully connected neural network to predict CPIs. In the results, we confirmed that the protein structure information was more useful than the protein sequence information and our proposed method achieved the highest accuracy on benchmark datasets. The proposed method is expected to be useful for various applications in drug discovery.

[1] Ryusuke, S. et al. iScience, 27, 6, 110032 (2024).
[2] Jhon, J. et al. Nature, 596, 583–589 (2021).