25-1 Text analytics project – The DRAGON Challenge

Recently, I started an NLP project for my <Text Analytics> class. Following our professor’s recommendation to choose an ongoing or recently finished competition, our team decided to take on the DRAGON Challenge. Over the next few months, I’ll be sharing a series of posts detailing our project experience.

(FYI: I referred to List of Data Science Competition Platforms when searching for a project)

Overview

  • The DRAGON Challenge (Diagnostic Report Analysis: General Optimization of NLP) involves developing NLP algorithms for automated medical data curation.
  • 🏥 Data Scope: Over 28,824 annotated medical diagnosis reports from 22,895 patients, collected from five Dutch care centers.
  • 📋 Tasks: The challenge comprises 28 clinically relevant tasks.
  • 🤖 Pre-trained Models: It offers pre-trained models that have been trained on 4,000,000 clinical reports. All models are available on HuggingFace.

Data

  • Clinical Reports: A total of 28,824 reports from 22,895 patients were included, gathered from five Dutch care centers.
  • Patient Visits: The data covers patients with diagnostic or interventional visits between January 1, 1995, and February 12, 2024.
  • Sample Reports: You can view sample reports on Github.
  • Annotation
    • For 27/28 tasks, all reports were manually annotated
    • For task 18, the 4803 development cases were automatically annotated using GPT-4
    • the 172 testing cases were manually annotated
  • Data Access: Due to privacy restrictions, participants cannot directly access or download the medical report data—it is only available through the Grand Challenge (GC) platform for model training and testing.

Pre-trained models

  • Models Available: BERT-base (Dutch), RoBERTa-base/large (Multilingual), Longformer-base/large (English)
  • Training data:
    • Medical reports from Ziekenhuisgroep Twente hospital.
    • Duration: July 13th 2000 – April 25th 2023
    • Size: 4,152,762 reports
    • Split: training(80%), validation(10%), test(10%)

Leave a Reply

Your email address will not be published. Required fields are marked *