P05-02

Enhancing the Reliability of Machine Learning Predictions through Quantitative Evaluation of the Applicability Domain: A Case Study of Multi-Task Prediction Model of Unbound Fraction in Human, Mouse, and Rat Plasma

 Yuki DOI *Harutoshi KATOAkira SASAKI

DMPK Research Laboratories, Mitsubishi Tanabe Pharma Corporation
( * E-mail: doi.yuuki@ma.mt-pharma.co.jp )

The process of drug development is inherently time-consuming and costly. Therefore, it would be beneficial to employ machine learning (ML) techniques to reduce the time and cost required for each stage of the process by predicting certain outcomes. While the accuracy of ML models (referred to as “activity models” in this study) is typically validated using test datasets during model development, assessing the reliability of predictions in real-world scenarios remains challenging. The inappropriate use of activity models can result in erroneous decisions, thereby undermining the trust in these models and reducing the potential applications of these models. The objective of this study is to develop an “error model” to predict the assurance of an activity model’s output by leveraging metrics that have been demonstrated to be correlated with the reliability of prediction (DA metrics). This approach aims to enhance the reliability of ML predictions.

The activity model utilized was a multi-task deep learning model that predicted the unbound fraction in human, mouse, and rat plasma. The DA metrics employed include Similarity, Local Error, and PREDICTED, as reported in the literature [1, 2]. The error model was developed using these DA metrics with Random Forest to classify whether the prediction error would be within a two-fold range. The probability predicted by the error model, indicating whether the prediction error is within two-fold, is referred to as the Confidence Score. The actual prediction error of the activity model was then compared with the Confidence Score. Furthermore, the impact of DA metrics on the Confidence Score was analyzed using SHAP (SHapley Additive exPlanations).

For compounds with a Confidence Score below 0.5, the proportion within 2-fold error was less than 50%. In contrast, for compounds with a Confidence Score above 0.5, the proportion within 2-fold error was 75% or greater. Additionally, as the threshold of Confidence Score for including the calculation of accuracy was increased, the R2 value was increased and RMSE value was decreased. SHAP analysis revealed that an increase in Similarity metrics and a decrease in Local Error metrics were associated with higher Confidence Scores.

These findings indicate that the Confidence Score is valuable tool for enhancing the reliability of activity model’s predictions and maximizing their appropriate application in drug development.

[1] Sheridan, R. P., Using random forest to model the domain applicability of another random forest model. J Chem Inf Model 2013, 53, 2837-50.
[2] Sheridan, R. P., The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity. J Chem Inf Model 2015, 55, 1098-107.