Machine learning classification of Plasmodium falciparum virulence genes using genomic differentiation scores and boosting algorithms
DOI:
https://doi.org/10.38029/babcockuniv.med.j..v9i1.1226Keywords:
Plasmodium falciparum, Machine Learning, Virulence Genes, Genomic Differentiation, LightGBM, SHAP, BioinformaticsAbstract
Objective: This study aims to identify virulence-associated genes in Plasmodium falciparum by applying machine learning models to genomic differentiation features, to aid in the discovery of novel therapeutic targets.
Methods: We utilised a dataset of 5,561 P. falciparum genes, labelled based on membership in known virulence gene families (VAR, RIF, EPF, RESA). Three genomic differentiation scores, Global Differentiation, Local Differentiation, and Distance to Higher Local Differentiation, served as input features. We evaluated five classifiers: Random Forest, Gradient Boosting, Support Vector Machine, XGBoost, and LightGBM. To handle class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied strictly within stratified 5-fold cross-validation folds, alongside hyperparameter tuning. Performance was assessed using accuracy, precision, recall (sensitivity), F1-score, and Area Under the Precision-Recall Curve (AUC-PR).
Results: LightGBM achieved the highest performance with a test accuracy of 85.14% ± 1.2% and an AUC-PR of 0.87 ± 0.02, significantly outperforming the next best model, XGBoost (p = 0.018). Feature importance analysis via SHAP (Shapley Additive Explanations) identified Local Differentiation Score as the most predictive feature.
Conclusion: Boosting algorithms, particularly LightGBM, are highly effective for classifying virulence genes based on genomic differentiation patterns. This approach provides a scalable, data-driven method for prioritising candidate virulence factors in P. falciparum for functional validation.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hussein KHT

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
