Recently, I started an NLP project for my <Text Analytics> class. Following our professor’s recommendation to choose an ongoing or recently finished competition, our team decided to take on the DRAGON Challenge. Over the next few months, I’ll be sharing a series of posts detailing our project experience.
(FYI: I referred to List of Data Science Competition Platforms when searching for a project)
Overview
- The DRAGON Challenge (Diagnostic Report Analysis: General Optimization of NLP) involves developing NLP algorithms for automated medical data curation.
- 🏥 Data Scope: Over 28,824 annotated medical diagnosis reports from 22,895 patients, collected from five Dutch care centers.
- 📋 Tasks: The challenge comprises 28 clinically relevant tasks.
- 🤖 Pre-trained Models: It offers pre-trained models that have been trained on 4,000,000 clinical reports. All models are available on HuggingFace.
Data
- Clinical Reports: A total of 28,824 reports from 22,895 patients were included, gathered from five Dutch care centers.
- Patient Visits: The data covers patients with diagnostic or interventional visits between January 1, 1995, and February 12, 2024.
- Sample Reports: You can view sample reports on Github.
- Annotation
- For 27/28 tasks, all reports were manually annotated
- For task 18, the 4803 development cases were automatically annotated using GPT-4
- the 172 testing cases were manually annotated
- Data Access: Due to privacy restrictions, participants cannot directly access or download the medical report data—it is only available through the Grand Challenge (GC) platform for model training and testing.
Pre-trained models
- Models Available: BERT-base (Dutch), RoBERTa-base/large (Multilingual), Longformer-base/large (English)
- Training data:
- Medical reports from Ziekenhuisgroep Twente hospital.
- Duration: July 13th 2000 – April 25th 2023
- Size: 4,152,762 reports
- Split: training(80%), validation(10%), test(10%)