RESEARCH PAPERS

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response
Read more
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization
Read more
Natural language processing for humanitarian action: Opportunities, challenges, and the path toward humanitarian NLP
Read more

ACCESS DATA

HumSet

HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry. Click the link below for details.

ACCESS MODELS

HumBert

HumBert (Humanitarian Bert) is a XLM-Roberta model trained on humanitarian texts – approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles. Data was collected from three main sources: Reliefweb, UNHCR Refworld and Europe Media Monitor News Brief. Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.

Intended uses

To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as disaster responses text classification) more effective. This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.