Skip to content

Latest commit

 

History

History
396 lines (273 loc) · 23.3 KB

0400-detailed-guide.en.adoc

File metadata and controls

396 lines (273 loc) · 23.3 KB

Detailed user guide

This section provides a section-wise guide to using the MDT including a step-wise processing guide

Users with prepared datasets ready to process: skip this section and go to: Select an MDT installation.

Users without prepared datasets: use e.g. the provided example dataset (see below).

Example dataset

This example dataset is a real metabarcoding dataset of surface seawater metazoan communities, that has been modified to better illustrate some features of the MDT.

  1. Download example_dataset_2.

  2. Explore the structure of the example data.

    • The OTU_table. Column headers (CO1_A1.1006.a.S1.L001, etc.) are IDs of the 159(160) samples. Row names (merged_CO1_1, etc.) are IDs of the 24.744 OTUs.

    • The Taxonomy table contains OTU IDs correspoding to those (rownames) in the OTU_table, the sequence, and taxonomic information.

    • The Samples table contains IDs correspoding to those (column names) in the OTU_table, and sample metadata: spatiotemporal data, date, etc. NB: Some fields are already using DwC terms; others are not yet standardized and will need manual mapping.

    • The Study table with information/data that is the same across all samples (and/or OTUs) in the whole dataset. All fields (the term column) are already using exact Darwin Core terms as needed for this table.

Note
The 160th column of the OTU table corresponds to a negative control sample (NEG), which should be excluded from the final dataset. The MDT will automatically remove any samples that are not present in both the Samples and OTU_table tables. Since the negative control has already been removed from the Samples table, it will not be included in the final dataset.

Select an MDT installation

  • Still learning or testing? Use the MDT Sandbox (Demo Installation).

  • Got a real dataset? Identify and use the most relevant hosted MDT installation for the dataset. how can we guide here??

  • Can’t find a relevant MDT installation? Use the GBIF Conversion MDT.

Landing page

  1. Log in with GBIF username

The landing page (home) is where users access their datasets, start new uploads, etc.

dg landing

The Menu Bar contains the following items

  • Administration (only visible to MDT managers)

  • My datasets

  • New dataset: start new data upload

  • Login / User area (user name): links to: logout, My datasets and Administration (for Managers).

My datasets

In this area users can access the datasets they have uploaded to the MDT.

  • Datasets can be opened (eye icon) and the Browse tab allows to visualize the data similar to the review step, Metadata allows inspection of basic dataset information/metadata, Files available gives access to download a standardized BIOM file and a [dwc-a] the dataset (if these have been produced).

  • Datasets can be edited. By clicking the pen icon, the dataset is opened at processing step 1 (Upload data). This allows users to initiate editing of the dataset at any of the processing steps.

  • Datasets can be deleted (trash icon).

    Important
    Do not delete or start editing a dataset that is already published on GBIF.org unless you are certain that this is needed.
dg mydata
Figure 1. My datasets gives users access to view, edit/update and download previously processed datasets.

Help, feedback and other resources

  • The Neep help? (fern leaf): Opens an email to contact the Manager of the MDT installation.

  • The RSS feed: Lists the most recently updated resources (datasets) available as an RSS feed. This can be used to monitor incoming datasets to the MDT installation.

  • The Report a bug and Request a feature : Opens a GitHub issue where users can report bugs or request features to be developed.

Upload data (step 1)

  1. Drag and drop the dataset to the upload area, or click and select the file.

  2. Give the dataset a nickname (e.g. "my_advanced_test").

  3. Press Start Upload.

    A green icon will indicate if a valid file format is detected.

Example dataset

  • A green XLSX icon will indicate that this valid file format is detected.

  • A warning will inform that one of the columns in the OTU_table does not have a corresponding row in the Sample table. We knew that, and it is OK, as we wanted that sample to be removed in the final data (see above).

advanced upload
  1. Open the data viewer by clicking on the eye icon next to the uploaded dataset.

  2. View and verify the structure and content of the tables - e.g. corresponding to the four sheets in the uploaded Excel Workbook for the example data.

  3. Close the viewer by pressing Back.

  4. Press Proceed.

Map terms (step 2)

At this step it is specified how each field corresponds to Darwin Core terms - i.e. mapping. It is possible to adjust the automatic mapping, to include extra fields with global values, and to add non-standardised data as so-called extended measurements or facts.

Tip
Press how to use this form to get a guided tour of this page.
Tip
Press Save mapping once in a while to make sure that you do not get logged out.

Automatic mapping

  1. Inspect the overall structure and information on the page.

    1. The upper section named Samples maps our sample data fields to Darwin Core terms (first column), automatically identifying and mapping fields from the Samples table (second column) and global fields from the Study table (third column) with their identically named Darwin Core counterparts.

    2. The second section named Taxon does the same for taxonomic and sequence related information, auto-mapping fields from the Taxonomy table to identically named Darwin Core fields.

    3. The last section Unmapped fields lists all the fields in the uploaded data, with names the MDT do not automatically recognize. Below there is an option to put unmapped fields into so-called Extended Measurement Or Facts.

  2. Press Save Mapping and notice how you get a warning if some essential fields have not been mapped.

Example dataset

advanced mapping samp
Figure 2. Sample section: the MDT automatically identified and mapped 6 fields from the uploaded Samples table (second column) and approx. 20 global term/values from the Study table (third column) to identically named Darwin Core terms (first colum).
advanced mapping tax
Figure 3. Taxon section: the MDT automatically identified and mapped 10 fields from the uploaded Taxonomy table.

Suggested mapping

  1. Inspect potential suggested mappings and click to accept and specify them.

Example data

advanced lat map
  1. term:dwc[decimalLatitude] and term:dwc[decimalLatitude] were not mapped automatically, but it is suggested to use Latitude and Longitude.

  2. Click on Latitude to specify that mapping, and the same for Longitude.

advanced seq map
  1. term:mixs[DNA_sequence] was not mapped automatically, but the MDT suggests to use sequence.

  2. Click on sequence to specify that mapping.

Manual mapping

Under Unmapped fields (above the Taxon section) we may see fields from the uploaded data, that were not automatically identified and mapped to any Darwin Core terms.

We expect that Darwin Core may accomodate several of these un-mapped fields, and we may also want to modify and extend the uploaded data a bit.

  1. Try to accomodate as many of unmapped fields by mapping them to relevant Darwin Core terms, by Add mapping for another sample field. Use the search function as explained above, consult the section [req_recom], or explore available fields/terms in Darwin Core Occurrence and the DNA derived data extension.

  2. Also use Add mapping for another sample field to add DwC terms, where you can provide global values that were not provided in the Study table of the uploaded data. Provide the global value of an added term in the third column (Default values).

Example dataset
Sample_Name, temperature, salinity, Accession_biosamples, lsid, rank are listed under Unmapped fields, as these fields in the uploaded data were not automatically identified and mapped to Darwin Core terms.

  1. One of the unmapped fields is called Accession_biosamples and contains links to Biosample records in INSDC (SRA/ENA). We want to map that field to the recommended Darwin Core term term:dwc[materialSampleID] for that.

  2. Go to the last part of the Sample section.

  3. Click on Add mapping for another sample field and look at the list of available terms.

    1. Start typing "material" to find and select term term:dwc[materialSampleID].

    2. Click Add field, and notice how the field is added to the list of terms.

    3. Now, select the field Accession_biosamples from the drop-down list to map it.

  4. We can also see that we forgot to provide the term:mixs[env_medium] in the format recommended using the ENVO ontology, but simply wrote "sea water". To fix that:

    1. Click on the pencil to the right of "sea water". A dialogue box opens.

    2. Remove "sea water" by clicking the "sea water x".

    3. Search for "coastal sea"

    4. Select "coastal sea water" with OBO ID "ENVO:00002150". (NB: you can also click the link and explore the ENVO ontology online).

    5. Scroll down and press "OK" to close the dialogue box.

      Note
      term:mixs[env_broad_scale] and term:mixs[env_local_scale] were also described with the ENVO ontology, but values were correctly was supplied in the uploaded data. Multiple values are possible: shoreline [ENVO:00000486] and intertidal zone [ENVO:00000316] for term:mixs[env_local_scale].
  5. As this data was also intended for publishing to OBIS, so-called lsid were provided for the taxonomic names according to the WoRMS checklist. Following the OBIS recommendations we will map that field to the term:dwc[scientificNameID].

    1. Go to the Taxon section.

    2. Click the "Add mapping for another Taxon/ASV field".

    3. Search, find and select term:dwc[scientificNameID].

    4. Map it to lsid.

  6. Similarly add the term term:dwc[taxonRank] and map it to rank.

Using Extended Measurements Or Facts

For some types of metadata there are no suitable fields in the data standards, but users may still want to include those in the final data. Such data can be accomodated in the Extended Measurement Or Facts (eMoF)

The section Unmapped fields fields that remain unmapped are listed. These can be added as extended measurements/facts.

  1. Click on any field name from the row of unmapped fields (in the Extended Measurement Or Facts section) and see how it is transferred to the eMoF section below as a new entry.

  2. Add as many of the other descriptive fields as possible

  3. Repeat for all fields you wish to include as eMoF

Example dataset

As there is no suitable field for salinity in the dataset, this can be added as an extended measurement.

  1. Click on salinity from the row of unmapped fields (in the Extended Measurement Or Facts section) and see how it is transferred to the section below as a new entry.

  2. We know that the measurement unit is ppt, so we add that manually.

NB: leave the fields temperature and Sample_Name unmapped.

Tip
All available terms/fields from Darwin Core Occurrence and the DNA derived data extension can be included in the upload files, and – if spelled correctly – no manual mapping is needed.
Note
Unmapped fields will be retained MDT, also in the standardized BIOM file, and will only be absent from the produced [dwc-a]. They can be mapped/added later by revising the mapping if needed.

Now, the mapping is complete.

  1. Press Proceed.

Process data (step 3)

The option Assign taxonomy uses the GBIF Sequence ID tool to assign taxonomy to the OTUs by comparing the sequences with a reference database. This will replace/overwrite any taxonomy provided in the uploaded data. NB: Often it will be better to do the taxonomic assignment separately – using the GBIF tool or another preferred tool – before uploading, as it is a time consuming step!

Example dataset
We recommend not to use assign taxonomy for the example dataset. It will take a long time. If using it, it will be evident that the current CO1 reference database (BOLD BINs) cannot assign taxonomy to a number of the sequences in this dataset. The rest of the guide assumes that you used the taxonomy in the uploaded data for the example dataset.

  1. Press Process data.

    the MDT goes through a series of steps which will be indicated as succesful with a green tick-mark, and produces standardized BIOM files, which the MDT uses as an intermediate file format.

Example dataset
A warning will indicate that "NEG in the OTU table are not present in the SAMPLE table". We already knew that (see above).

advanced processing
  1. Inspect the Dataset stats and verify that number of samples and taxa are as expected.

  2. Press Proceed

Review (step 4)

At this step the processed data can be explored and reviewed to verify that everything is OK, e.g. ensure that control samples are not included as samples, and that the mapping is as expected.

advanced review
  1. Review the data.

    • Inspect the map and verify that the samples are placed geographically where expected.

    • Inspect the taxonomic bar-chart to ensure that taxonomic composition is as expected.

      • Try some of the other options for the bar-chart (e.g. Absolute read abundance).

    • Inspect PCoA/MDS ordination plots (visualization of compositional differences between samples) for outliers, e.g. to see if there any control samples that should have been excluded. Try to color the ordination plot by some numerical.

    • Select single samples from the map, from charts or from the dropdown, and explore their metadata and taxonomy in the panel to the right.

    • Explore the "Most frequent OTUs" and "Least frequent OTUs".

Example dataset

  • Map: Pillar Point, Half Moon Bay, California, USA.

  • Ordination plots: Try color salinity or temperature. temperature was not mapped to any DwC term, but such unmapped fields are included in the BIOM files facilitating these visualizations.

  • "Most frequent OTUs" and "Least frequent OTUs": mainly "Incertae sedis", and thus not not so informative.

  1. Press Proceed.

Add metadata (step 5)

In this step, dataset information is added, including dataset title, description, persons and affiliations, etc.

Important
For real-world datasets, it is highly recommended to provide comprehensive and detailed metadata. This enhances discoverability, promotes reusability, and helps data re-users better evaluate the dataset’s suitability for their specific purposes.
Tip
Toggle "Show help" to get guidance text for the fields.
advanced metadata
Figure 4. Edit metadata. In this section you provide dataset information/metadata in defined sections (left panel): Basic Metadata, Geographic Coverage, Taxonomic Coverage, etc.

The dataset information/metadata is grouped in sections: Basic Metadata, Geographic Coverage, Taxonomic Coverage, etc.

Example dataset
Explore the metadata sections to get familiar with options. When working with a real dataset, please refer to the full guidance in this section (not written) and to the section on [preparation_structure].

Minimum actions to continue: select a license and add a contact (minimum: email).

  1. Press Proceed.

Export (step 6)

At this step, the so-called [dwc-a] is produced. The dataset can be published directly to the GBIF TEST ENVIRONMENT (UAT) in this step. This is a way for users to preview a potential real GBIF.org publication.

  1. Press Create DWC archive.

    This process generates the [dwc-a] from the data, progressing through several steps, each marked with a green check if successful.

  2. Press Publish to GBIF test environment (UAT).

    A notification will inform you that data ingestion may take a few minutes before all samples are visible in the GBIF test environment (UAT). A link to the dataset in the test environment will appear next to the "Publish" button.

  3. Click on the link Dataset at gbif-uat.org.

  4. Explore the dataset in the GBIF test environment (UAT).

  5. Ensure that all information and data is processed and displayed appropriately as you expect.

advanced uat
Figure 5. Pressing Dataset at gbif-uat.org opens the dataset (here the example dataset) in the GBIF test environment (UAT) where users can preview a potential real publication, and verify that the processed dataset contains all the wanted information for real publication. NB: The dataset is not completely indexed immediately and the dataset may e.g. have 0 occurences and no map compared to this figure, until fully indexed. Notice how the hatched/shaded header and the red "TEST" label indicate that it is a test environment. Explore the dataset and notice how the uploaded data and dataset information/description is presented on the website.
  1. Go back to the MDT.

  2. Press on Publish (directly in the header with the 7 steps).

Publish (step 7)

In this step it is possible to do the necessary steps for formally publishing the processed dataset to GBIF.

Important
Users still learning/testing with example datasets should not actively use any options from this step.

MDTs can either be in publishing mode or conversion-only mode. Go to the relevant section based on what MDT you are using:

dg publish publmode
Figure 6. MDTs in publishing mode.
dg publish demo2
Figure 7. MDTs in conversion-only mode

MDT in publishing mode

MDTs in publishing mode can publish directly to GBIF.

When publishing their first dataset, users will not yet have been associated with a publishing organization. In order to publish a dataset to GBIF, the user’s institution/organisation must be registered as a data publisher in GBIF.

dg publish find institution
Figure 8. To publish the first dataset on GBIF, a user’s institution must be registered as a data publisher, and the user must be associated with the institution in the MDT.

Find/register your institution using the search box to search for your institution.

  • If your institution is already registered:

    • Select it, and click on "Ask for access to publish under this institution/organisation". This will start a preformulated email to the manager of the MDT, asking to associate you with the institutuion in the MDT.

    • Send the mail, and allow some time for the MDT manager to get back you you.

  • If you can´t find your institution/organisation:

    • Click on "Ask for help with registering your institution/organisation". This will start a preformulated email to the manager of the MDT, asking for help with the steps needed in order to get your institution recognized as a GBIF data publisher and to associate you so your dataset can be published.

    • Add the relevant information about your institution in the mail:

      • INSTITUTION NAME

      • INSTITUTION ADDRESS

      • CONTACT EMAIL

    • Send the mail, and allow some time for the MDT manager to get back you you

dg publish find institution step2
Figure 9. If your institution is already registered, select it and request access to publish. If not, ask for help with registering.

When associated with a publishing organization in the MDT, the user can publish datasets to GBIF.org:

  1. Select the correct publishing institution if associated with more than one.

  2. Confirm in the checkbox that you have read and understood the data sharing agreement.

  3. If relevant (users will usually know), select the Network that the dataset should be associated with. NB: selecting "OBIS" will notify OBIS that the dataset is available so they can index it. (is that how this works?)

  4. Press Publish to gbif.org

  5. That’s it.

dg publish publish
Figure 10. Users that have been associated with one or more publishing organization by the MDT administrator, can publish datasets to GBIF. Confirm the data sharing agreements, and select a Network if relevant, and publish.

MDT in conversion-only mode

When using a hosted MDT in conversion mode or the GBIF Conversion MDT, the publishing step will require the user to contact the administrator of the MDT by using the link "Ready to publish a dataset? Reach out to the administrator for assistance".

Note
User that have prepared datasets using the GBIF Conversion MDT should investigate whether there is a relevant hosted MDT adminstered by a relevant GBIF node. how can we guide here?
dg publish demo2
Figure 11. Users of MDT installations in conversion-only should use the link "Ready to publish a dataset? Reach out to the administrator for assistance". This will start a preformulated email to the manager of the MDT, requesting help.

Publish through IPT

This publishing procedure may be relevant for:

  • Data publishers that cannot have their data in a hosted repository.

  • Managers of a MDT in conversion-only mode.

The Integrated Publishing Toolkit — commonly referred to as the IPT — is a free open-source software developed by GBIF and used by organizations around the world to create and manage repositories for sharing biodiversity datasets. If you have access to an IPT and know how to use it, you can download the [dwc-a] produced by the MDT at the Export (step 6) and publish it through an IPT.

MOTE: By downloading dataset from the MDT and publishing elsewhere, the possibility for easy updating, re-processing and visualization of the dataset in the MDT is lost.

The MDT produces a fully publishable [dwc-a] with no need for changes or additions. The archive can validate in the GBIF data validator.

Uses may run into challenges if using older versions of the IPT and/or if the DNA-derived data extension has not been installed. Also there is a known issue that requires the values of the license fields to be set manually.

Publishing an archive from the MDT via IPT

  • Download the DwC-A (archive.zip) from the MDT.

  • login to the IPT.

  • Press Magage Resources.

  • Press Create new.

  • Give your dataset a Shortname.

  • Select Occurrence under Type.

  • Choose Import from an archived resource. and select the archive on the computer.

  • Press Create.

  • Validate and verify that the data looks as expected.

  • Publish the data as one would normally do in the IPT.

Register and host DwC-A elsewhere

A Darwin Core Archive produced with the MDT may be put elsewhere on the web – preferably in a stable repository (e.g. Zenodo, GitHub) – and can then be indexed by GBIF. This requires somebody to register the new resource with GBIF.

  • Download the DwC-A (archive.zip) from the MDT.

  • Put the archive in a stable repository so you have an URL: www/xxx/archive.zip

  • Register the dataset with the relevant publisher in the GBIF registry (How is that done ?).

    • See e.g. this video on sharing to GBIF via APIs.

    • See this blog post on general possibilities to publish and host datasets.