Most parts of his glossary is identical to the glossary of the general DNA publishing guide. Entries unique to this version are marked with "(§)".
- Atlas of Living Australia (ALA)
-
The ALA is a web-based platform that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable to anyone (see https://www.ala.org.au/about-ala/). The open infrastructure platform developed by the ALA is also used by several other countries for their own national biodiversity data platform (see https://living-atlases.gbif.org/).
- Amplicon Sequence Variant (ASV)
-
Unique DNA sequence derived from high-throughput sequencing and denoising, and assumed to represent a biologically real sequence variant. See also Operational Taxonomic Unit (OTU) and (Callahan et al. 2017).
- Application Programming Interface (API)
-
Set of protocols and tools for interaction and data transmission between different computer applications.
- Barcode Index Numbers (BINs)
-
Species-level Operational Taxonomic Units (OTUs) derived from clustering of the cytochrome c oxidase I (COI) gene in animals. Each BIN is assigned a globally unique identifier, and is made available in searchable database within the Barcode of Life Data System (BOLD). ::https://biom-format.org/[BIOM format^] (canonically pronounced biome) is designed to be a general-use format for representing biological sample by observation contingency tables. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium supported project.
- Barcode of Life Data System (BOLD)
-
BOLD is the reference database maintained by the Centre for Biodiversity Genomics in Guelph on behalf of the International Barcode of Life Consortium (IBOL). It hosts data on barcode reference specimens and sequences for eukaryote species, particularly COI for animals, and maintains the Barcode Index Number (BIN; Ratnasingham & Hebert 2013) system, identifiers for OTUs of approximately species rank, based on clusters of closely similar sequences.
- Biodiversity data platform
-
General online resource to discover and access biodiversity data derived from various sources, such as natural history collections, citizen science, ecology and monitoring projects, and genetic sequences. Can be global (GBIF) or national (ALA).
- Clustering
-
In taxonomic classification, the process of grouping organisms together according to some similarity criterion. See Operational Taxonomic Unit.
- Community (bulk) DNA
-
DNA from bulk samples (e.g. plankton samples or Malaise trap samples consisting of several individuals from many species). For the purpose of this guide, bulk sample DNA is included in the eDNA concept.
- Darwin Core Archive (DwC-A)
-
Compressed (ZIP) file format for exchange of biodiversity data compiled in accordance with the Darwin Core (DwC) standard. Essentially a self-contained set of interconnected CSV files and an XML document describing included files and data columns, and their mutual relationships.
- Darwin Core term
-
a standardized field name (e.g. term:dwc[decimalLatitude] is the official DwC term for geographical latitude). (§)
- Darwin Core (DwC) standard
-
Standard for sharing and publishing biodiversity data, originating from the Biodiversity Information Standards (TDWG) community. In principle, a set of terms used for describing different entities of biodiversity observations, such as sampling events, occurrences and taxa. Current Darwin Core terms are described in the Quick Reference Guide.
- Data vocabulary
-
Set of preferred terms or concepts with specific, well-defined meanings and interrelationships, facilitating data exchange and reuse.
- ddPCR (droplet digital Polymerase Chain Reaction)
-
Droplet digital PCR. Method for measuring absolute amount of DNA (number of copies) of one marker in a sample. See also qPCR.
- Denoising
-
In metabarcoding, method for separation of true biological sequences (see ASVs) from spurious sequence variants caused by PCR amplification and sequencing error.
- Digital Object Identifier (DOI)
-
Long-lasting reference used to uniquely identify (and locate) digital information objects, such as a biodiversity data set or a scientific publication.
- DNA barcoding and metabarcoding (amplicon sequencing)
-
Use of short, standardized DNA fragments to identify individual organisms via sequencing. Metabarcoding combines barcoding with high-throughput DNA sequencing, using universal primers to amplify and sequence large groups of organisms in eDNA samples.
- DNA marker
-
A DNA fragment used as a marker of some property (e.g., taxonomic affiliation). May, but does not have to, be a gene or a part of a gene.
- DNA metabarcoding database
-
Database containing DNA sequences (DNA barcodes) from previously recovered or studied organisms. The reference sequences were ideally generated from individuals of described, well-studied species-with the type specimen serving as the ideal-or higher taxonomic level (e.g., genus, family), but may also stem from eDNA sequencing efforts. It is wise not to trust “reference sequences” blindly.
- dna-derived data
-
An extension to Occurrence core to capture information relating to DNA (e.g. primers, the sequence, sequencing platform, etc.). This extension is based on the MIxS standard used by the "GenBanks". (§)
- DNA probe
-
A short, synthetic single-stranded DNA fragment with fluorescent labelling that binds to a selected region of target DNA (marker) during PCR. Increases specificity and can be used in addition to primers in qPCR and ddPCR to detect and quantify a genetic marker.
- European Bioinformatics Institute (EMBL-EBI)
-
Intergovernmental organization for bioinformatics research and services, part of the European Molecular Biology Laboratory (EMBL), providing eg. (raw) sequence reads and assembly data via the European Nucleotide Archive (ENA).
- Environmental DNA (eDNA)
-
DNA from an environmental sample, e.g. soil, water, air or host organism. An often used definition is that environmental DNA is the genetic material (DNA) obtained from environmental samples without any obvious evidence of biological source material (Thomsen and Willerslev 2015).
- European Nucleotide Archive (ENA)
-
European repository for nucleotide sequences, covering raw sequencing data, sequence assembly information and functional annotation. Includes the Sequence Read Archive (SRA), and is maintained by the European Bioinformatics Institute (EMBL-EBI), as part of the International Nucleotide Sequence Database Collaboration (INSDC).
- Endpoint
-
In the context of GBIF, an "endpoint" refers to a URL or web address where a DwC-A can be accessed through the internet, and indexed by GBIF. (§)
- FASTQ
-
Text-based standard for storing molecular sequences and associated quality measures deriving from High-throughput sequencing (HTS). For each sequence position, single ASCII-characters are used to represent base call (identified nucleotide) and score, respectively.
- Global Biodiversity Information Facility (GBIF)
-
International network and research infrastructure, mainly focused on mobilizing and providing open access to global biodiversity data.
- Global Genome Biodiversity Network (GGBN)
-
International network of institutions concerned with efficient sharing and usage of genomic biodiversity samples and associated metadata, e.g. promoting the Darwin Core-compatible GGBN Data Standard.
- Global Positioning System (GPS)
-
Satellite navigation system operated by the United States Space Force.
- High-throughput sequencing (HTS)
-
Different technologies for massively parallel sequencing, producing millions of DNA sequence reads from library preparations of genetic material, rather than targeting single amplicons as in traditional Sanger sequencing. Also called Next Generation Sequencing (NGS).
- Ingestion
-
Process of importing data from heterogeneous sources, such as local databases, text files or spreadsheets, to a common destination system, such as an online biodiversity data platform, for storage and further analysis. Typically includes steps of extraction, transformation (cleaning) and loading (ETL).
- Indexing
-
Organization of information in accordance with a specific schema or structure, making data easier to access and present.
- International Nucleotide Sequence Database Collaboration (INSDC)
-
Joint effort of the DNA Databank of Japan (DDBJ), EMBL and NCBI to provide global public access to nucleotide sequence data and associated information.
- Integrated Publishing Toolkit (IPT)
-
The Integrated Publishing Toolkit — commonly referred to as the IPT — is free open-source software developed by GBIF and used by organizations around the world to create and manage repositories for sharing biodiversity datasets.
- Metagenomics
-
PCR-free sequencing of random genomic fragments in a mixed sample.
- Minimum Information about any (x) Sequence (MIxS) standard
-
Family of standards (checklists) for sequence metadata, developed by the Genomic Standards Consortium (GSC).
- molecular Operational Taxonomic Unit (mOTU)
- National Center for Biotechnology Information (NCBI)
-
Division of United States National Library of Medicine (NLM) housing important bioinformatics resources, such as the GenBank database of DNA sequences, and the Sequence Read Archive (SRA) of high throughput sequencing data.
- Next Generation Sequencing (NGS)
- Occurrence
-
An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time.
- Occurrence core
-
The part of DwC that includes all the central information (fields) on biological occurrences in GBIF (e.g. spatiotemporal data, taxonomy, etc), also for eDNA data. (§)
- Operational Taxonomic Unit (OTU)
-
a group of closely related organisms treated as a single unit in ecological and taxonomic studies. OTUs can be defined by observable traits or by genetic data. In environmental DNA (eDNA) studies, especially metabarcoding, OTUs are clusters of DNA sequences grouped by similarity (often 97%) to approximate species-level classification. This approach helps researchers estimate biodiversity, assess community composition, and explore ecological interactions, even without taxonomic identification of the OTUs. To represent each OTU, a representative sequence — typically the most common, abundant, longest or central sequence in the cluster — is chosen to serve as a reference point for downstream analysis and comparisons across samples. OTUs can be defined per dataset, in which case the representative sequence of similar OTUs may be different between datasets depending on how it was picked. OTUs can also be defined globally to facilitate unambiguous referencing between studies, like Species Hypothesis in UNITE, and Barcode Index Numbers in the Barcode of Life Data System (BOLD). Amplicon Sequence Variants (ASVs) may be considered analogous to zero radius OTUs (zOTUs). The term molecular Operational Taxonomic Unit (mOTU) is often seen used for molecular OTUs.
- OTU table
-
Spreadsheet that holds the number of sequencing reads detected of each OTU/sequence in each sample.
- Polymerase Chain Reaction (PCR)
-
Technique for fast amplification and detection of specific fragments of target DNA (or RNA) sequences. Amplified regions are determined by the pair of PCR primers used in the reaction.
- Pipeline
-
In bioinformatics, a set of algorithms or tools applied in a predefined workflow to process e.g. High-throughput sequencing (HTS) data.
- Primers (PCR primers)
-
Short, synthetic, single-stranded DNA fragments that bind to a selected region of target DNA (marker) to initiate replication during PCR. A pair of primers is necessary for the polymerase enzyme to amplify the selected marker.
- qPCR (quantitative Polymerase Chain Reaction)
-
Quantitative PCR. Method that measures relative DNA quantity of a marker in a sample. See also ddPCR.
- Sample
-
Material (water, soil, gut content, etc) obtained for analysis.
- Sequence alignment
-
Bioinformatic process of comparing and arranging two or more molecular (DNA, RNA or protein) sequences to detect similarities caused by e.g. evolutionary relatedness.
- Species Hypothesis (SH)
-
Species-level Operational Taxonomic Unit (OTU) as defined in the UNITE database and sequence management environment, for Fungi.
- Specimen
-
An individual animal, plant, fungus, etc. used as an example of its species or type for scientific study or display.
- Sequence Read Archive (SRA)
-
Public repository of high throughput (NGS) sequencing data, with instances operated by the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EMBL-EBI), and the DNA Data Bank of Japan (DDBJ). Includes both raw (non-denoised) sequencing output and sequence alignments. One of three components of the European Nucleotide Archive (ENA), and previously known as the Short Read Archive.
- Target-capture sequencing
-
Sequencing of DNA fragments isolated with hybridization probes.
- UNITE
-
UNITE is a web-based sequence management environment centred on the eukaryotic nuclear ribosomal ITS region. All public sequences are clustered into species hypotheses (SHs), which are assigned unique DOIs. An SH-matching service outputs various elements of information, including what species are present in eDNA samples, whether these species are potentially undescribed new species, other studies in which they were recovered, whether the species are alien to a region, and whether they are threatened. The DOIs are connected to the taxonomic backbone of the PlutoF platform and GBIF, such that they are accompanied by a taxon name where available. The data used in UNITE are hosted and managed in PlutoF. Data are represented through a range of standards, primarily Darwin Core, MIxS, and DMP Common Standard; partial support is available for EML, MCL, and GGBN. PlutoF exports data primarily through the CSV and FASTA formats. PlutoF can also be used to publish data in GBIF (using the DwC format) and to prepare GenBank submission files. It is furthermore possible to download species lists from your data and download your project as a JSON document with project data in hierarchically structured.
- Zero radius otu (zOTU)
-
See ASV.