Webinar

Datasets through the Lđź‘€king-Glass

Datasets through the Lđź‘€king-Glass is a webinar series focusing on the data aspects on learning-based methods. Our aim is to build a community of scientists interested in understanding how the data we use affects the algorithms and society as a whole, instead of only optimizing for a performance metric. We draw inspiration from a variety of topics, such as data curation to build datasets, meta-data, shortcuts, fairness, ethics and philosophy in AI.

All previous talks where the authors have agreed to share the talk, can be found in our YouTube playlist.

Next webinar: Investigating medical datasets

Date: 12th May 2025 at 10am CEST

Where: Zoom, register here

Speakers:

Title: Racial bias in cardiac imaging

Abstract: Artificial intelligence (AI) methods are being used increasingly for the automated segmentation of cine cardiac magnetic resonance (CMR) imaging. However, these methods have been shown to be subject to race bias; i.e. they exhibit different levels of performance for different races depending on the (im)balance of the data used to train the AI model. We trained AI models to perform race classification on cine CMR images and/or segmentations from White and Black subjects and found that the classification accuracy for images was higher than for segmentations. Interpretability methods showed that the models were primarily looking at non-heart regions. A number of possible confounders for the bias in segmentation model performance were identified for Black subjects but none for White subjects. Distributional differences between annotated CMR data of White and Black races, which can lead to bias in trained AI segmentation models, are predominantly image-based. Most of the differences occur in areas outside the heart, such as subcutaneous fat.

Title: Seeing the Same Brain Twice: Data Leakage and Identity Bias in Brain MRI Analysis

Abstract: Deep learning models for medical imaging are often praised for their accuracy and the potential for healthcare application. However, many overlook subtle pitfalls that can undermine real-world applicability. One such issue is data leakage, which can lead to misleadingly optimistic performance. In this talk, we explore how subtle forms of data leakage, especially in longitudinal data, can lead models to exploit subject identity rather than learning meaningful clinical features. We discuss the challenges of cross-validation design in this context, and how even robust-looking 3D CNNs can pick up identity cues when repeated scans of the same subject leak into both training and validation sets. The use of GradCAM further exposes the shortcuts taken by the model. By revisiting our findings from the Alzheimer’s disease classification task, we reflect on how careful dataset splitting and evaluation strategies can improve robustness and fairness in neuroimaging AI.

Title: In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

Abstract: Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static — they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://facct.itu.dk/.

Previous talks:

All previous abstracts can be found here.

  • S01E01 - Dr. Roxana Daneshjou (Stanford University School of Medicine, Stanford, CA, USA). 27th Feb 2023. Challenges with equipoise and fairness in AI/ML datasets in dermatology
  • S01E02 - Dr. David Wen (Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, UK). 27th Feb 2023. Characteristics of open access skin cancer image datasets: implications for equitable digital health
  • S01E03 - Prof. Colin Fleming (Ninewells Hospital, Dundee, UK). 27th Feb 2023. Characteristics of skin lesions datasets
  • S02E01 - Prof. Amber Simpson (Queen’s University, Canada). 5th June 2023. The medical segmentation decathlon
  • S02E02 - Dr. Esther E. Bron (Erasmus MC - University Medical Center Rotterdam, the Netherlands). 5th June 2023. Image analysis and machine learning competitions in dementia
  • S02E03 - Dr. Ujjwal Baid (University of Pennsylvania, USA). 5th June 2023. Brain tumor segmentation challenge 2023
  • S03E01 - Dr. Thijs Kooi (Lunit, South Korea). 18th September 2023. Optimizing annotation cost for AI based medical image analysis
  • S03E02 - Dr. Andre Pacheco (Federal University of EspĂ­rito Santo, Brazil). 18th September 2023. PAD-UFES-20: the challenges and opportunities in creating a skin lesion dataset
  • S04E01 - Dr. Jessica Schrouff (Google DeepMind, UK). 4th December 2023. Detecting shortcut learning for fair medical AI
  • S04E02 - Rhys Compton and Lily Zhang (New York University, USA). 4th December 2023. When more is less: Incorporating additional datasets can hurt performance by introducing spurious correlations
  • S04E03 - Dr. Enzo Ferrante (CONICET, Argentina). 4th December 2023. Building and auditing a large-scale x-ray segmentation dataset with automatic annotations: Navigating fairness without ground-truth
  • S05E01 - Hubert Dariusz ZajÄ…c and Natalia-Rozalia Avlona (University of Copenhagen, Denmark). 25th March 2024. Ground Truth Or Dare: Factors Affecting The Creation Of Medical Datasets For Training AI
  • S05E02 - Dr. Annika Reinke (DKFZ, Germany). 25th March 2024. Why your Dataset Matters: Choosing the Right Metrics for Biomedical Image Analysis
  • S05E03 - Alceu Bissoto and Dr. Sandra Avila (UNICAMP, Brazil). 25th March 2024. The Performance of Transferability Metrics does not Translate to Medical Tasks
  • S06E01 - Hava Chaptoukaev and Maria Zuluaga (EURECOM, France). 24th February 2025. Acquiring, curating and releasing a multi-modal dataset for stress detection: ambitions, achievements, mistakes and lessons learned
  • S06E02 - Alice Jin (Massachusetts Institute of Technology, USA). 24th February 2025. Fair Multimodal Checklists for Interpretable Clinical Time Series Prediction
  • S06E03 - Malih Alikhani and Resmi Ramachandranpillai (Northeastern University, USA). 24th February 2025. Towards Equity: Overcoming Fairness Challenges in Multimodal Learning

All previous abstracts can be found here.

Organizers

Amelia Jiménez-Sánchez, Théo Sourget & Veronika Cheplygina at the IT University of Copenhagen (Denmark), and Steff Groefsema at the University of Groningen (the Netherlands). This project has received funding from the Independent Research Fund Denmark - Inge Lehmann number 1134-00017B.

Newsletter

If you want to receive information about upcoming seminars, please sign up to our mailing list. We pick the GDPR-compliant Brevo (formerly Sendinblue) as our mail provider. If you have any concerns relating to our data handling, please read our privacy notice.

Please be aware that many mail providers are tagged as junk, and the confirmation email might end up in your spam folder. Double check if your confirmation email is there. The sender will be PURRlab @ IT University of Copenhagen (amji @ itu.dk). Please add this sender to your contacts. If you have any problems subscribing to our mailing list, please contact Amelia.