When Genes Meet Algorithms: What Machine Learning Really Adds to Predicting MS and Alzheimer’s Risk
Complex disorders such as multiple sclerosis (MS) and Alzheimer’s disease (AD) are quintessentially polygenic: risk is distributed across many variants of small effect, embedded within correlated genomic architectures shaped by linkage disequilibrium (LD). The article by Arnal Segura and colleagues (2025, International Journal of Molecular Sciences) addresses a practical and timely question in statistical genomics: when the predictors are high-dimensional, partially redundant, and only indirectly linked to biology, which machine learning (ML) paradigms actually deliver robust disease classification from genotype data—and under what constraints do they fail?
Cohorts, Case–Control Construction, and Predictor Curation
The primary analyses leverage UK Biobank (UKB) genotypes, with external validation performed on MS data from the International Multiple Sclerosis Genetics Consortium (IMSGC; dbGAP phs000139.v1.p1) and AD data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Rather than using genome-wide SNP arrays indiscriminately, the study restricts predictors to variants curated in ClinVar (with at least one review-status level) or DisGeNET disease-associated variants, and includes sex as a binary covariate. When an HLA gene is implicated, imputed HLA types (UKB Field 22182) are incorporated, reflecting the known immunogenetic burden in MS and enabling downstream interpretability centered on the major histocompatibility complex.
Quality Control, Encoding Strategy, and Genotype Imputation
Rigorous preprocessing is applied using PLINK-based filtering (Hardy–Weinberg equilibrium p-value threshold, minor allele frequency threshold, missingness limits per marker and per sample), followed by LD computation to explicitly quantify predictor correlations. Genotypes are encoded under an additive scheme with an explicit missing-value code (0 = missing; 1 = absence; 2 = heterozygous presence; 3 = homozygous presence), and monomorphic predictors are removed. Missing genotypes are imputed only for array variants already present but partially missing, using SHAPEIT4 for phasing and IMPUTE5 for imputation with 1000 Genomes Phase 3 reference panels; low-probability imputations are treated as missing and low-quality imputed variants are excluded. This pipeline is important because classification performance in genomic ML can be dominated by subtle QC artifacts if missingness and imputation uncertainty are not explicitly controlled.
Model Families and Evaluation Protocol: Stability as a First-Class Outcome
The authors benchmark logistic regression (LR), three ensemble tree methods—Gradient-Boosted Decision Trees (GB), Random Forest (RF), Extremely Randomized Trees (ET)—and two deep learning (DL) architectures: feedforward neural networks (FFN) and convolutional neural networks (CNN). A nested cross-validation design (10-fold inner loop for hyperparameter selection; 5-fold outer loop for generalization estimates) is used, with undersampling strategies to balance cases and controls and metrics chosen to remain meaningful under class imbalance (balanced accuracy, sensitivity, specificity). Critically, the article treats variance across folds (standard deviation of performance metrics) as an empirical proxy for stability—an aspect often under-emphasized in genomic ML, yet central for clinical translation and reproducibility.
Core Performance Findings in MS and AD
Across UKB-derived evaluations, LR emerges as consistently stable and competitive, delivering the highest mean balanced accuracy for MS (≈0.635) with the lowest variability among methods (SD ≈0.005), while DL approaches display larger fold-to-fold variability despite similar mean performance (FFN ≈0.629; CNN ≈0.619 with substantially higher SDs). For AD, RF and ET are among the top performers (RF ≈0.681; ET ≈0.675), with LR closely trailing (≈0.674) and again showing comparatively low variance. External validation on IMSGC (MS) and ADNI (AD) does not produce the expected collapse in sensitivity or balanced accuracy, supporting generalization beyond the discovery cohort and reducing concerns that the models are merely capturing cohort-specific confounding. A key interpretation is that in curated-variant feature spaces with LD-induced correlation, “simpler” linear decision boundaries can be more reliable than parameter-rich DL models that are prone to instability under modest sample perturbations.
Benchmarking Against Polygenic Risk Scores and the Meaning of “Prediction”
To compare ML classifiers with polygenic risk score (PRS) modeling, the study aligns fold partitions and converts both ML probability outputs and PRS values into “predicted positive” classes using an extreme-risk cutoff (upper 99th percentile), then reports relative risk (RR) and odds ratios (OR). PRS tends to perform at a broadly “average” level relative to ML in this framework: it can enrich for cases at the extreme tail (e.g., MS RR on the order of ~4; AD RR on the order of ~5–6), but it does not consistently dominate the better-performing ML methods. Notably, LR shows particularly strong enrichment in MS at the 99th percentile (RR > 5 in the reported summary), illustrating that a well-regularized linear model operating on curated disease variants can behave like a discriminative analogue of polygenic scoring—yet with classification-calibrated outputs and explicit optimization for balanced accuracy rather than purely additive risk aggregation.
Explainability, Biological Coherence, and the Role of LD
The interpretability analyses (feature coefficients for LR; impurity-based importances for tree ensembles; layer-integrated gradients for FFN/CNN) converge on a biologically coherent picture for MS: prioritized variants concentrate in non-coding regulatory space, frequently annotated as expression or splicing quantitative trait loci (eQTL/sQTL) in GTEx, and are enriched for immune-related loci, prominently including HLA annotations and imputed HLA types (e.g., signals consistent with established MS immunogenetic architecture). Among highlighted coding signals, missense variants such as rs6897932 in IL7R and rs763361 in CD226 align with T-cell development and immune regulation, reinforcing the disease’s immunological basis. Importantly, the authors emphasize that LD creates clusters of correlated predictors; the encouraging result is that the tested ML methods can maintain performance even with correlated features, but the interpretability layer becomes essential to avoid over-interpreting any single variant as uniquely causal. In aggregate, the article argues for a pragmatic genomics ML stance: prioritize stability, validated generalization, and biologically anchored explanation over architectural complexity, especially when moving toward clinically meaningful genomic predisposition models.
Disclaimer: This blog post is based on the provided research article and is intended for informational purposes only. It is not intended to provide medical advice. Please consult with a healthcare professional for any health concerns.
References:
Arnal Segura, M., Bini, G., Krithara, A., Paliouras, G., & Tartaglia, G. G. (2025). Machine learning methods for classifying multiple sclerosis and Alzheimer’s disease using genomic data. International journal of molecular sciences, 26(5), 2085.
