O01_03

Machine learning approach to analyze DNA-encoded library screening data for hit identification

Syunya SUZUKI *Kazuma KAITOHYoshihiro YAMANISHI

Graduate School of Informatics, Nagoya University
( * E-mail: suzuki.syunya.h7@s.mail.nagoya-u.ac.jp )

DNA-encoded library (DEL) is a new technology for hit compound screening, where each compound in the DEL has a DNA tag whose sequence identifies the structure of the compound. Typically, a DEL is composed of millions or billions of compounds, and DEL is expected to contribute to reduction of the cost and time for identifying hit compounds in the pharmaceutical industry. Each compound in the DEL has a central scaffold that is directly linked to a DNA tag and associated side chain structures. Amplifying the DNA tags of compounds that interact with the target protein using PCR and reading them using next-generation sequencing enable us to detect compound-target protein interactions. However, compounds often interact not only with the target protein but also with the matrix that immobilizes the target protein. At the stage of amplifying and reading the DNA tags, it is not possible to distinguish between compounds that interact with the target protein and those with the immobilizing matrix. Therefore, frequently observed false positive hits are a serious obstacle in the DEL screening.

In this study, we proposed a machine learning approach to distinguish true positive hits and false positive hits in the DEL screening. We constructed a discrimination model to extract the substructures involved in false positive compounds based on the results of DEL screening in which the target protein was immobilized on a fixed support and the results of screening in which only one fixed support was used. The proposed method successfully identified compounds that interacted with the target protein and those with the immobilizing matrix separately, and the use of the Shapley value of the discriminant model contributed to the extraction of the substructures involved in the interaction with the target molecule and those with the immobilizing matrix. The proposed approach is expected to be useful for distinguishing false positive hits in the DEL screening analysis and for designing DELs consisting of compounds that avoid interactions with the immobilizing matrix.