Back Outstanding achievements of a second-year EMAI student

Outstanding achievements of a second-year EMAI student

Two papers co-authored by  Pedro Moreira have been accepted for prestigious conferences: Cross-Care at NeurIPS 2024 and RABBITS at EMNLP 2024!

09.10.2024

Imatge inicial -

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
🔸 Authors: Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman
🔸 Conference: NeurIPS 2024
🔸 We’re excited to share Cross-Care, a new LLM benchmark that dives deep into safety and bias in the healthcare domain. Our findings highlight a significant misalignment between real biomedical facts and the representations in language models, shaped by their pre-training data. 
🔸 Key Takeaways:
1. Models don't know true disease prevalence—they often reflect biases from their pre-training corpora.
2. Their "understanding" fits co-occurrences of demographic terms and diseases.
3. Alignment methods might actually deepen these biases.
4. Training on language-specific datasets (usually in English) impacts understanding only in that language.
🔸 Everything is open-source! Visualize disease and demographic co-occurrences and explore the downstream impacts at crosscare.net.
🔸 Read more: Cross-Care Paper 

RABBITS: Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
🔸Authors: Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman
🔸 Conference: EMNLP 2024
🔸 We took language models to the drug store… and found out they perform differently when asked about "acetaminophen" versus "Tylenol"! Robustness is crucial for medical LLM applications, as drug names can change model responses significantly. Our study with RABBITS reveals that swapping between brand and generic drug names in medical benchmarks like MedQA and MedMCQA results in notable performance drops. Even questions not directly about the drugs show changes, indicating that memorization, biases, and data contamination may be inflating model performance.
🔸 Check out the RABBITS leaderboard and explore our findings: GitHub and HuggingFace Leaderboard 
🔸 Read more: RABBITS Paper