Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction.

TitleGene Ontology density estimation and discourse analysis for automatic GeneRiF extraction.
Publication TypeJournal Article
Year of Publication2008
AuthorsGobeill, J, Tbahriti, I, Ehrler, F, Mottaz, A, Veuthey, A-L, Ruch, P
JournalBMC bioinformatics
Volume9 Suppl 3
Date Published2008
KeywordsAlgorithms, Artificial Intelligence, Genes, MEDLINE, Natural Language Processing, Pattern Recognition, Automated, Proteins, Sensitivity and Specificity, Terminology as Topic, Vocabulary, Controlled

BACKGROUND: This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.

RESULTS: Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).

CONCLUSIONS: Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.

Alternate JournalBMC Bioinformatics
PubMed ID18426554