Skip to content

Benchmarking ML models for predicting DT Cancer recurrence with nested CV and explaining predictions with SHAP (XAI).

Notifications You must be signed in to change notification settings

espositomario/DTC-Recurrence-ML-SHAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Differentiated Thyroid Cancer (DTC) Recurrence Prediction and SHAP Analysis

Authors: Irene D'Onofrio and Mario Esposito

Aim: This project provides a robust benchmarking of machine learning models for predicting differentiated thyroid cancer (DTC) recurrence, using nested cross-validation (nCV) for evaluation and SHAP analysis for interpretation.

Methods: Models tested included Support Vector Machine, XGBoost, Random Forest, Decision Trees, Logistic Regression, and Multi-Layer Perceptron, evaluated using a stratified 5-fold outer nCV with a 3-fold inner loop for hyperparameter tuning. SHAP analysis was applied to the best-performing model (SVM) to assess feature importance and explain predictions.

Results: SVM, XGBoost, and RF showed the strongest generalization, with SVM achieving the highest average MCC of 0.91 ± 0.04. SHAP analysis identified "Response" as the most influential feature (followed by others), and provided insight into misclassified cases.

DTC Dataset

UC Irvine Machine Learning Repository: DTC Dataset

  • Donated: 10/30/2023
  • Description: 13 clinicopathologic features collected over 15 years, with a minimum 10-year follow-up per patient.
  • Dataset Characteristics: Tabular
  • Primary Task: Binary Classification
  • Target label: Recurred/Not Recurred
  • Instances: 383
  • Suggested split: No
  • Features: Age, Gender, Smoking, Hx Smoking, Hx Radiotherapy, Thyroid Function, Physical Examination, Adenopathy, Pathology, Focality, Risk, T, N, M, Stage, Response
  • Reference: Springer Link

Jupyter notebook Table of Contents

  • Exploratory Data Analysis
    • Data downloading
    • Order categories (Ordinal features)
    • Plot Features Distributions
    • Plot Feature Distributions stratified per classes (Recurred / Not Recurred)
    • Feature Encoding
  • Nested Cross-Validation (nCV)
    • Models hyperparameters space
    • Stratified 5-fold nCV (3-fold inner CV)
    • Save or Import existing nCV_results
    • Compare models metrics on testing
      • MCC, ROC AUC and PRC AUC
      • ROC and PRC curves
  • SHAP analysis on SVM
    • SHAP on testing data (loop in the outer CV)
    • Save or Import existing results
    • SHAP Visualization
      • Global feature importance
      • SHAP values per features (sample-wise)
      • Feature values effect on prediction (sorted by average feature importance)
    • Misclassified samples

Other Studies Using the Same Dataset

  1. MDPI Paper
  2. arXiv Paper
  3. Springer Link

About

Benchmarking ML models for predicting DT Cancer recurrence with nested CV and explaining predictions with SHAP (XAI).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published