The Elusive Reliability of CKD Risk Models
Chronic kidney disease prediction models excel internally but falter externally, raising questions about their deployment readiness. Calibration and uncertainty need more focus.
Chronic kidney disease (CKD) poses a significant health concern, and machine learning models promise to predict its risk effectively. Yet, the reliability of these models remains in question when subjected to varying conditions. A new study trained five classifiers using the UCI CKD dataset, which consisted of 400 patients with a prevalence rate of 62.5% for CKD.
The Internal-External Disparity
Classifiers like logistic regression, random forest, and XGBoost, among others, were rigorously tested for calibration quality and deployment readiness. Internally, they achieved an area under the receiver operating characteristic curve (AUROC) of 1.00, suggesting near-perfect performance. However, this isn't the full story.
A distributional stress-test that applied these models to the MIMIC-IV demo cohort, consisting of 97 patients and a lower CKD prevalence of 23.7%, revealed a stark drop in performance. The AUROC fell dramatically to a range of 0.48-0.58. If these models can't maintain performance under different conditions, what good is their supposed accuracy?
The Calibration Conundrum
Calibration, often overshadowed by model accuracy, surfaced as a critical weakness. While isotonic recalibration reduced internal Expected Calibration Error (ECE) to as low as 0.000, in the real-world scenario of the MIMIC-IV cohort, ECE surged to between 0.68 and 0.76. These metrics highlight that internal robustness doesn't equate to external reliability.
Conformal prediction coverage, another measure of model confidence, also took a hit. Internally, it ranged from 0.80 to 0.98 against a 90% target. Yet, on external data, coverage plummeted to 0.21-0.25. This raises a pressing question: are we ready to trust these models in clinical settings when their uncertainty estimates falter so significantly?
Rethinking Deployment Readiness
The study's deployment readiness checklist further underscores this gap. No model scored above 4 out of 16, a disconcerting revelation for those banking on AI solutions in healthcare. Before these models can be deployed, calibration stability and conformal coverage must be prioritized. The AI-AI Venn diagram is getting thicker as technology converges with healthcare, but the compute layer needs a payment rail that isn't just about numbers on paper.
In essence, until these models demonstrate reliable performance on diverse external data, they remain unfit for clinical deployment. We must scrutinize calibration and uncertainty with the same vigor we apply to accuracy. The convergence of AI and healthcare promises much, but without reliable ground truths, it's a promise left unfulfilled.
Get AI news in your inbox
Daily digest of what matters in AI.