KYBELE - Knowledge Yield from BiodivΕrsity Literature through Large Language Model Extraction
A new research project, KYBELE, has officially been launched to transform how biodiversity knowledge is accessed, explored, and reused. By combining artificial intelligence, FAIR data principles, and Europe’s leading research infrastructures, KYBELE aims to unlock valuable information hidden in decades of biodiversity literature.
Turning biodiversity literature into actionable knowledge
Biodiversity information is widely distributed across scientific publications and databases, often making it difficult to retrieve and connect data on species, habitats, and ecological traits. KYBELE addresses this challenge by developing an AI-powered system that links genomic resources with ecological and taxonomic information, enabling more efficient exploration of existing knowledge.
As an initial demonstrator, the project focuses on selected groups of soil invertebrates, such as springtails (Collembola) and earthworms (Oligochaeta), which are well represented in both genomic and ecological data sources.
FAIR data at the core of the project
A key component of KYBELE is the creation of FAIR-compliant biodiversity resources. The project will compile structured catalogues of species names, habitats, and ecological traits using established data sources, including BOLD, Fauna Europaea, and BiodiversityPMC. Domain-specific vocabularies will be used to annotate biodiversity literature, improving consistency and interoperability with existing taxonomic and environmental ontologies.
These curated datasets will support accurate data reuse and form the basis for AI-driven analysis.
AI-driven exploration of biodiversity literature
KYBELE will develop and fine-tune large language models to support retrieval-augmented question answering over biodiversity literature. Using parameter-efficient techniques such as Low-Rank Adaptation (LoRA) and knowledge distillation, the models will be adapted to biodiversity-specific content while maintaining traceability to original sources.
The goal is to provide an interactive literature exploration tool that allows users to query biodiversity publications in natural language, while preserving data provenance and transparency.
Deployment within European research infrastructures
The KYBELE system will be deployed as a containerised service, integrated within the ELIXIR and LifeWatch ERIC infrastructures. This ensures interoperability with existing services and supports long-term availability. The source code, models, and deployment recipes will be made openly available through established repositories and registries. Reproducibility and technical sustainability