Closing the loop: from paper to protein annotation using supervised Gene Ontology classification.

TitleClosing the loop: from paper to protein annotation using supervised Gene Ontology classification.
Publication TypeJournal Article
Year of Publication2014
AuthorsGobeill, J, Pasche, E, Vishnyakova, D, Ruch, P
JournalDatabase (Oxford)
Volume2014
Date Published2014
ISSN1758-0463
Abstract

UNLABELLED: Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/.
DATABASE URL: http://eagl.unige.ch/GOCat4FT/.

DOI10.1093/database/bau088
Alternate JournalDatabase (Oxford)
PubMed ID25190367
PubMed Central IDPMC4154439