METAPLANTCODE

Plants drive terrestrial ecosystems and as autotrophs, are among the most important organisms in the terrestrial food web. Plants are directly affected by climate change which will have a profound effect on ecosystems as well as on primary production and agriculture. Currently, an estimated 2 out of 5 plant species are threatened with extinction (Antonelli et al. 2020). Their loss will also affect other groups of organisms.

Metabarcoding of plants allows to monitor plants’ functional dependencies and organismal networks like food webs, pollination and dispersal e.g. through the analysis of feces and plant traces in guts to monitor herbivores’ diets, plant traces attached to organisms, e.g. for pollinator monitoring by using Malaise or pan traps. To understand and alleviate the driving forces behind the current unprecedented biodiversity change calls for accurate monitoring, harmonization of protocols, high data quality and interoperability among different databases and pipelines, and streamlining of efforts (IPBES 2019; Grooten et al. 2020, Kolter & Gemeinholzer 2021). Metabarcoding describes the analysis of complex environmental DNA (eDNA) samples with the aim of taxonomic identification. The method can be standardized and automated and is suitable for high throughput large scale and long term monitoring. Metabarcoding has the potential to provide a scale and accuracy in biodiversity surveys that was previously unattainable for many taxonomic groups (Deiner et al. 2017; Ruppert et al. 2019, Kolter & Gemeinholzer 2021).

Nevertheless, the increase in technical complexity, compared to most other monitoring methods, also implies a higher susceptibility to errors and therefore requires stringent quality control and harmonization across techniques, workflows and institutions (Deiner et al. 2017; Ruppert et al. 2019; Thalinger et al. 2020). Plant metabarcoding is not as straightforward as in many animal groups (for a review see Deiner et al. 2017), as one barcode marker will not be sufficient for correct species level identification across the whole plant kingdom due to evolutionary constraints (e.g. hybridization of species across different taxonomic levels, polyploidy, rapid and slow radiating groups, apomixis). DNA barcode sequences might sometimes result in resolutions on clades of multiple species or higher taxonomic level only, which is not meeting the requirements for most monitoring purposes. The use of multiple markers in eDNA samples, however, causes assignment problems. Reference databases for comparing unknown sequences with known sequences are constantly being improved in international genbanks, in the course of national Barcode initiatives (NORBOL, ARISE, GBOL, and others), in the Barcode of Life Data System (BOLD; Ratnasingham & Hebert 2007), and by initiatives like Biodiversity Genomics Europe (BGE), which continuously increases the accuracy of identification. Nevertheless, even comprehensive reference databases that are too non-specific decrease taxonomic precision.

The continuous optimization of reference databases highlights the importance of versioning and citability in appropriate repositories. Considering the great importance of plants for organismal interactions and as the primary producers of ecosystems, we here propose the use of not only DNA sequences but also additional information for correct species identification in an AI context (e.g. georeferenced species occurrence data (GBIF.org), site specific vegetation information and species checklists, information on polyploidy, and hybridization, e.g. Biolflor.org). By adding further information to the plant metabarcoding results and evaluating results in an AI context, the accuracy of taxonomic identification will resemble BLAST results in genbanks such as ENA, NCBI and DDBJ. Furthermore, the applicability of taxonomic names changes over space and time, and the same name might refer to different taxonomic concepts (e.g. in reference databases, species checklists, genbanks, different versions of the plant red lists in different countries, etc.). Linking a taxonomic name to a specific concept defined in a taxonomic treatment in a reference publication (Biodiversity Literature Repository ;ChecklistBank) will allow for comparison of identifications over space and time, to find additional data about the taxon through text and data mining, and make use of synonyms and re-descriptions to widen the application of additional data. The project builds upon existing structures, networks and standards, which need further enhancement and harmonization, e.g. across national and international initiatives (e.g. GBIF, BOLD, TDWG, GBOL, NorBol, SBDI, ARISE). Furthermore, specific databases are emerging or in the process of emerging (e.g., ASV banks). Different bioinformatic pipelines are currently being built, and ELIXIR and the de.NBI-cloud services are available for online analysis. Deep learning algorithms, and more specifically pre-trained transformer-based models, can make use of already available knowledge (e.g. DNA sequences, literature), which in some parts need to be made digitally available and taxonomically comparative in standardized ways. With selected user case studies, plant metabarcoding pipelines can be harmonized and optimized for different infrastructures and researchers across Europe. Exchange between European GBIF nodes and communication with other DNA barcoding monitoring activities (e.g. freshwater monitoring) can benefit a larger community in the future.