Webinar
Datasets through the Lđź‘€king-Glass
Datasets through the Lđź‘€king-Glass is a webinar series focusing on the data aspects on learning-based methods. Our aim is to build a community of scientists interested in understanding how the data we use affects the algorithms and society as a whole, instead of only optimizing for a performance metric. We draw inspiration from a variety of topics, such as data curation to build datasets, meta-data, shortcuts, fairness, ethics and philosophy in AI.
All previous talks where the authors have agreed to share the talk, can be found in our YouTube playlist.
Next webinar: Language and image data
Date: 20 October 2025 at 9:30am CET
Where: Zoom: Register here
Speakers:
- Yuki Arase - Tokyo Institute of Technology, Japan
Title: Japanese Medical Text Simplification Using Patient Blogs
Abstract: Text simplification aims to automatically rewrite complex sentences into simpler and more accessible forms. In the medical domain, such simplification is highly desired to enhance patient understanding, yet it remains underexplored in Japanese due to the scarcity of linguistic resources. In this study, we construct a parallel corpus for evaluating Japanese medical text simplification, using data collected from patient weblogs. The corpus consists of 1,425 pairs of complex and simplified sentences, with and without medical terminology. To further improve readability, we introduce a lexically constrained reranking method that suppresses the output of technical terms. Experimental results show that our approach enhances simplification performance in the medical domain, demonstrating its potential for patient-centered healthcare communication.
- Mamunur Rahaman - University of New South Wales, Australia
Title: Advancing Computational Pathology: Multimodal Datasets and Deep Learning Insights
Abstract: The histopathological assessment of tissue biopsies remains the gold standard for cancer diagnosis, but it is limited by subjectivity and challenges in capturing molecular heterogeneity. This talk explores a suite of AI frameworks that integrate deep learning with multimodal datasets, including histopathology images, spatial transcriptomics, and clinical trial data from sources like RTOG 0521 and CHAARTED, to improve cancer diagnostics, prognostication, and treatment personalization. Key innovations include HistopathAI for robust classification under class imbalance, an AI biomarker for Head and Neck Squamous Cell Carcinoma, BrST-Net for predicting gene expression from H&E slides in breast cancer, ST-DoxPCa for stratifying docetaxel response in prostate cancer, and MR-PHE for zero-shot learning in rare disease histopathology. By leveraging diverse datasets such as whole-slide images and genomic profiles, these methods bridge histology and genomics, offering scalable, clinically actionable insights that advance precision oncology.
- David Restrepo - CentraleSupélec, France
Title: Opening Eyes: Advancing Equitable AI Through Open Ophthalmology Data
Abstract: Artificial intelligence has shown great promise in ophthalmology, yet concerns around fairness, bias, and generalizability persist—particularly for populations underrepresented in clinical research. In this talk, I present our efforts to address these challenges through the development of three open and representative datasets: BRSET, mBRSET, and Multi-OphthaLingua. These resources capture diverse retinal images and associated demographic and clinical data from Latin America and beyond, enabling systematic benchmarking of AI performance across geographic, socioeconomic, and linguistic dimensions. Together, they provide a foundation for measuring algorithmic bias and for building equitable AI systems in ophthalmology, spanning both image-based diagnostics and multilingual question-answering.
Previous talks:
All previous abstracts can be found here.
- S01E01 - Dr. Roxana Daneshjou (Stanford University School of Medicine, Stanford, CA, USA). 27th Feb 2023. Challenges with equipoise and fairness in AI/ML datasets in dermatology
- S01E02 - Dr. David Wen (Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, UK). 27th Feb 2023. Characteristics of open access skin cancer image datasets: implications for equitable digital health
- S01E03 - Prof. Colin Fleming (Ninewells Hospital, Dundee, UK). 27th Feb 2023. Characteristics of skin lesions datasets
- S02E01 - Prof. Amber Simpson (Queen’s University, Canada). 5th June 2023. The medical segmentation decathlon
- S02E02 - Dr. Esther E. Bron (Erasmus MC - University Medical Center Rotterdam, the Netherlands). 5th June 2023. Image analysis and machine learning competitions in dementia
- S02E03 - Dr. Ujjwal Baid (University of Pennsylvania, USA). 5th June 2023. Brain tumor segmentation challenge 2023
- S03E01 - Dr. Thijs Kooi (Lunit, South Korea). 18th September 2023. Optimizing annotation cost for AI based medical image analysis
- S03E02 - Dr. Andre Pacheco (Federal University of EspĂrito Santo, Brazil). 18th September 2023. PAD-UFES-20: the challenges and opportunities in creating a skin lesion dataset
- S04E01 - Dr. Jessica Schrouff (Google DeepMind, UK). 4th December 2023. Detecting shortcut learning for fair medical AI
- S04E02 - Rhys Compton and Lily Zhang (New York University, USA). 4th December 2023. When more is less: Incorporating additional datasets can hurt performance by introducing spurious correlations
- S04E03 - Dr. Enzo Ferrante (CONICET, Argentina). 4th December 2023. Building and auditing a large-scale x-ray segmentation dataset with automatic annotations: Navigating fairness without ground-truth
- S05E01 - Hubert Dariusz ZajÄ…c and Natalia-Rozalia Avlona (University of Copenhagen, Denmark). 25th March 2024. Ground Truth Or Dare: Factors Affecting The Creation Of Medical Datasets For Training AI
- S05E02 - Dr. Annika Reinke (DKFZ, Germany). 25th March 2024. Why your Dataset Matters: Choosing the Right Metrics for Biomedical Image Analysis
- S05E03 - Alceu Bissoto and Dr. Sandra Avila (UNICAMP, Brazil). 25th March 2024. The Performance of Transferability Metrics does not Translate to Medical Tasks
- S06E01 - Hava Chaptoukaev and Maria Zuluaga (EURECOM, France). 24th February 2025. Acquiring, curating and releasing a multi-modal dataset for stress detection: ambitions, achievements, mistakes and lessons learned
- S06E02 - Alice Jin (Massachusetts Institute of Technology, USA). 24th February 2025. Fair Multimodal Checklists for Interpretable Clinical Time Series Prediction
- S06E03 - Malih Alikhani and Resmi Ramachandranpillai (Northeastern University, USA). 24th February 2025. Towards Equity: Overcoming Fairness Challenges in Multimodal Learning
- S07E01 - Amelia Jiménez-Sánchez (IT University of Copenhagen, Denmark). 12th May 2025. In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review
- S07E02 - Tiarna Lee (King’s College London, UK). 12th May 2025. Racial bias in cardiac imaging
- S07E03 - Dewinda J. Rumala (UNIVERSA AI, Switzerland). 12th May 2025. Seeing the Same Brain Twice: Data Leakage and Identity Bias in Brain MRI Analysis
All previous abstracts can be found here.
Organizers
Amelia Jiménez-Sánchez, Théo Sourget & Veronika Cheplygina at the IT University of Copenhagen (Denmark), and Steff Groefsema at the University of Groningen (the Netherlands). This project has received funding from the Independent Research Fund Denmark - Inge Lehmann number 1134-00017B.
Newsletter
If you want to receive information about upcoming seminars, please sign up to our mailing list. We pick the GDPR-compliant Brevo (formerly Sendinblue) as our mail provider. If you have any concerns relating to our data handling, please read our privacy notice.
Please be aware that many mail providers are tagged as junk, and the confirmation email might end up in your spam folder. Double check if your confirmation email is there. The sender will be PURRlab @ IT University of Copenhagen (amji @ itu.dk). Please add this sender to your contacts. If you have any problems subscribing to our mailing list, please contact Amelia.