This paper examines how the performance of AI-based medical decision tools is measured, showing that common metrics can be misleading and that clinicians need multiple complementary measures to properly evaluate these systems.
Scientific paper - On Evaluation Metrics for Medical Applications of Artificial Intelligence
Anno 2022
Artificial intelligence (AI) and machine learning (ML) systems are increasingly being tested in healthcare settings to support clinical decision-making, e.g. to detect polyps during colonoscopies or to classify diseases from medical images. Before these tools can be trusted in real-world practice, their performance must be rigorously assessed. However, here there is a highlight of a widespread and underappreciated problem: the metrics used to report how well these AI systems perform are frequently incomplete, misleading, or misunderstood; even by the researchers who develop them.
It focuses on binary classification tasks: situations where an AI model must decide between two possible outcomes (e.g., a tissue sample is either cancerous or not). Through the analysis of five published gastroenterology studies, the authors demonstrate five key pitfalls in how performance is typically reported:
- Precision vs. Recall: A model can appear to have near-perfect performance by correctly detecting almost all positive cases (high recall / sensitivity), while failing to flag that a large proportion of its positive predictions are false alarms (low precision). These two measures must be reported together, because the appropriate balance between them depends on the clinical stakes: missing a cancer diagnosis has different consequences from generating unnecessary follow-up procedures.
- Class imbalance effects: When one outcome is much rarer than the other (e.g., disease is rare in the general population), a model that always predicts the majority class can still achieve high accuracy while being completely useless clinically. Standard metrics can mask this failure when the distribution of cases in training and testing sets differs.
- Selective reporting on filtered data: Some studies remove the cases on which their model performs poorly before calculating metrics. This inflates apparent performance and provides a misleading picture of how the tool would behave in a real clinical environment, where such filtering is not possible.
- Evaluating positive and negative classes separately: Reporting performance only for cases containing disease without equally examining performance on healthy cases can conceal critical weaknesses. A model that appears excellent at detecting illness may simultaneously perform very poorly at correctly clearing healthy patients.
- Failure to report comprehensive metrics: Most studies report only a handful of metrics (commonly accuracy and sensitivity). The study argues that all four fundamental counts: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN); should be published, along with a broader panel of measures, including the Matthews Correlation Coefficient (MCC), which gives a more balanced picture of performance even when classes are unequal in size.
The authors released MediMetrics, a free, open-source, web-based tool that allows researchers and clinicians to calculate and compare all relevant metrics from published studies, enabling reproducibility and fair cross-study comparison.
The broader implication of this work is methodological and normative: the choice of performance metric is not a neutral technical decision but a value-laden one. What counts as an acceptable rate of missed diagnoses, or of false alarms, depends on clinical priorities, patient safety standards, and institutional requirements. The paper argues strongly for transparent reporting, strict separation of training and testing data, and independent evaluation across different patient populations and hospital settings.
Author of the paper: Steven A. Hicks, Inga Strümke, Vajira Thambawita, Malek Hammou, Michael A. Riegler, Pål Halvorsen & Sravanthi Parasa
Publisher or journal of publication: Scientific Reports (Nature Portfolio)
The paper is available at the following link.