Vai menu di sezione

Scientific paper - The limits of fair medical imaging AI in real-world generalization
Anno 2024

This paper explores how artificial intelligence systems designed for medical imaging are capable of indirectly learning demographic information about patients (gender, age, race) and using it as “shortcuts” in diagnoses. This finding compromises the quality of diagnoses for certain groups and indicates unpredictable behavior in fairness metrics when these models are used in hospitals with a patient population different from the one in which the model was trained. Although methods for correcting these biases by removing these variables have been proposed, there is no definitive solution to this problem, as this data remains essential in many clinical contexts and requires a detailed assessment by the medical team based on the specific case.

What are demographic ‘shortcuts’?

Recent studies have shown that image-based AI models can consistently learn demographic information about patients (e.g., age, sex, race) during training, even though they were not designed to do so.  This information is used by the model as a shortcut to infer a correlation with the disease, even though it has no direct relationship to it, and may affect the accuracy of its predictions across different subgroups.

Analyses conducted in ophthalmology, dermatology, and radiology across various countries confirm that the models that most heavily encode demographic information are the most biased, and identify the problem as a structural consequence of training on real medical data.

Is it possible to correct this bias in the hospital?

The use of debiasing robustness methods during training (ReSample, GroupDRO, DANN, CDANN, MA) or the application of the Pareto Front concept to identify models that, while not perfect, can achieve a good balance between accuracy and fairness can help mitigate these biases in the hospitals where these models were trained. However, efforts to optimize fairness must always be accompanied by an evaluation of other clinically relevant metrics for predictions.

Could this model also work at another hospital in my city? Or in another country?

It is likely that the diagnostic accuracy and performance of the model will remain consistent across different hospitals regardless of the specific facility, but fairness is unpredictable as it is affected by changes in context: different disease prevalence, new patient demographics, or the image acquisition and annotation methods (the differential impact of the distribution shift).

Surprisingly, the most effective criterion identified for selecting a model that maintains its fairness when transferred to another hospital is not to choose the one with the smallest fairness gap in training, but rather the one that encodes the least demographic information internally.

Final Implications

Given the circumstances, it is essential to correct these biases in the hospitals where these models were trained. However, questions arise regarding their long-term usefulness if the metrics are not maintained consistently across hospitals.

Furthermore, not all demographic variables can be excluded if they are directly related to the clinical presentation of the disease, as doing so could lead to a less accurate diagnosis for certain groups. Whether to include these variables in the models and identify them as direct causes or not is a decision that must always be made by the clinical team, not the algorithm.

We must be aware that, when a model fails to ensure equity in a new hospital, the problem may stem from two sources: either the model was already unfair in the hospital where it was trained, or the demographic characteristics or prevalence of diseases in the new hospital may affect some groups more than others. To take the appropriate action, you need to identify the cause.

The FDA does not require external validation of medical AI systems; it is therefore recommended that performance be continuously monitored across different patient demographic groups, given that the patient population may change over time.

Finally, we must not forget that there are many definitions of fairness, some of which are even incompatible with one another. For this reason, each centre should explicitly choose which definition best suits the needs of its clinical context.

Author of the paper: Yuzhe Yang, Haoran Zhang, Judy W. Gichoya, Dina Katabi and Marzyeh Ghassemi

Publisher or journal of publication: Nature Medicine

The paper is available at the following link.

María Morales Martínez, BSC
Pubblicato il: Venerdì, 28 Giugno 2024 - Ultima modifica: Martedì, 12 Maggio 2026
torna all'inizio