BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2: Multilingual Information Extraction

TitleBiTeM at CLEF eHealth Evaluation Lab 2016 Task 2: Multilingual Information Extraction
Publication TypeConference Paper
Year of Publication2016
AuthorsMottin, L, Gobeill, J, Mottaz, A, Pasche, E, Gaudinat, A, Ruch, P
Conference NameCLEF 2016
Date Published08/2016 - CEUR Workshop Proceedings
Conference LocationEvora, Portugal
KeywordsAutomatic Text Categorization, Concept Normalization, Discontinuous Entity Extraction, ICD-10, Named-Entity Recognition, Relocation, Statistical Training, UMLS

BiTeM/SIB Text Mining ( is a University research group carrying over activities in semantic and text analytics applied to health and life sciences. This paper reports on the participation of our team at the CLEF eHealth 2016 evaluation lab. The processing applied to each evaluation corpus (QUAREO and CépiDC) was originally very similar. Our method is based on an Automatic Text Categorization (ATC) system. First, the system is set with a specific input ontology (French UMLS), and ATC assigns a rank list of related concepts to each document received in input. Then, a second module relocates all of the positive matches in the text, and normalizes the extracted entities. For the CépiDC corpus, the system was loaded with the Swiss ICD-10 GM thesaurus. However a late minute data transformation issue forced us to implement an ad hoc solution based on simple pattern matching to comply with the constraints of the CépiDC challenge. We obtained an average precision of 62% on the QUAREO entity extraction (over MEDLINE/EMEA texts, and exact/inexact), 48% on normalizing this entities, and 59% on the CépiDC subtask. Enhancing the recall by expanding the coverage of the terminologies could be an interesting approach to improve this system at moderate labour costs.