Machine Learning Algorithm for Predicting Subcellular Localization Sites for Proteins
DOI:
https://doi.org/10.5281/zenodo.14302877Abstract
Predicting the location of a protein within the cell can help in elucidating its function and deducing its involvement in certain biochemical pathways. In this study, machine learning models are investigated to predict the Protein subcellular Localization Sites in Cells. The aim of this research is to develop an algorithm that can accurately predict the Protein subcellular Localization Sites in Cells Processes that help in determine healthy cell which is crucial for understanding protein functions and their roles in various biochemical pathways, the onset of disease and its potential use as a drug target. Two models were explored, which are Logistic Regression; A statistical model suitable for binary classification, which estimates probabilities of localization based on input features and K-Nearest Neighbor (KNN); a non-parametric method that classifies proteins based on the majority label of their nearest neighbors in the feature space for prediction of the Protein subcellular Localization Sites in Cells. A comprehensive dataset containing protein sequences and their corresponding subcellular localization labels was curated. Relevant features from the protein sequences were extracted; the dataset was divided into training and testing sets. Models were trained on the training set, and their performance was evaluated on the testing set. Model performance was assessed using several key metrics: Classification Accuracy, F-score Precision and Recall which was found through confusion matrix. K-Nearest Neighbors (KNN) achieved the highest accuracy of 98% and a precision of 100%, indicating it correctly classified almost all instances and did not misclassify any positives. Logistic Regression demonstrated a classification accuracy of 92%, with precision and recall values of 96%. While it performed well, it was not as effective as KNN in this context. The confusion matrix provided insights into the model performance, revealing rates of true and false positives, which are crucial for understanding the models' strengths and weaknesses. The findings suggest that K-Nearest Neighbors (KNN) is the more suitable model for predicting protein subcellular localization sites in cells, offering higher accuracy and precision compared to Logistic Regression.