The coded workflow corresponding with the data descriptor: Madin et al. (XXXX) A global synthesis of bacterial and archaeal phenotypic trait and environment data. Scientific Data XXXX
- amend-shock.csv
- bacdive-microa.csv
- bergeys.csv
- campedelli.csv
- corkrey.csv
- edwards.csv
- engqvist.csv
- faprotax.csv
- fierer.csv
- genbank.csv
- gold.csv
- jemma-refseq.csv
- kegg.csv
- kremer.csv
- masonmm.csv
- mediadb.csv
- methanogen.csv
- microbe-directory.csv
- nielsensl.csv
- pasteur.csv
- patric.csv
- prochlorococcus.csv
- protraits.csv
- roden-jin.csv
- rrndb.csv
- silva.csv
Directory | Content |
---|---|
data/ |
Contains all the raw data. This directory is read only (i.e., do not change these files!) |
data/conversion_tables/ |
Contains tables for mapping traits among datasets to ensure standard naming and values. This directory also contains the data_corrections file for changes to records with errors. |
data/raw/ |
Contains the raw data file for each of the datasets in the merger. Each raw data sub-directory contains a README file that explains where the dataset came from and notes about how it was processed. |
data/taxonomy/ |
Contains the NCBI taxonomy database dump. |
R/ |
All R code for preparing and merging/condensing the final datasets, including files containing global functions, loading necessary R packages, and global settings (outlines below). Data preparation files are in the preparation directory. The main merge files are condense_species.R and condense_traits.R . |
R/preparation |
Contains the individual dataset preparation scripts. |
R/taxonomy |
Contains the NCBI taxonomy and taxonomic mapping preparation scripts. |
R/web_extraction |
Contains script for utilizing APIs for extracting data from web-based datasets. |
output/ |
For all R code output that is used for analysis or during the merge and preparation process. Everything in this folder can be regenerated by the R code. |
output/prepared_data/ |
Contains the processed 'cleaned up' version of all included data frames - these are generated by the individual clean scripts for each dataset and are loaded into the merger code |
output/prepared_references/ |
Contains the processed reference table. |
output/taxonomy/ |
Contains pre-processed taxonomy maps and the main NCBI mapping table with other taxonomic levels. |
.gitignore |
What should be ignored by git versioning; also won't appear on Github. This currently contains important files that are too large for Github, like the NCBI taxonomy and the Bergey's PDFs. |
README.md |
This is what you're reading. There are other "README" files through the file system, including the data directory for information about datasets, etc. |
workflow.R |
The final analysis code that loads and prepares each raw dataset and then runs the merge and condensations. |
bacteria_archaea_traits.Rproj |
You double-click on this when you've cloned the GitHub project to your computer, and the project is opened in RStudio. Otherwise, this file that can be ignored. |
There are two datasets too large to be kept at GitHub, and these are automatically downloaded from Figshare when workflow.R
is first run. If you experience problems, these files need to be manually downloaded from their sources and placed in the project directory as outlined here.
File | Where to access | Where to place |
---|---|---|
genome_metadata.txt |
ftp://ftp.patricbrc.org/RELEASE_NOTES/ | data/raw/patric/ |
taxonomy_names.csv |
https://figshare.com/s/ab40d2a35266d729698c | output/taxonomy/ |
The settings.R
file in the R
directory is where the different scripts are told which data sets to include in the merger and how to handle different traits and whether they will need to be translated. This file will need to be updated if new traits are added to the merger by inserting the respective column name into the appropriate vector variable (see description below).
Any varaible that is required in the preparation code is named in capital and with the word "CONSTANT_" in front so as to avoid being accidentally overwritten.
Variable | Description |
---|---|
CONSTANT_PREPARE_FILE_PATH |
This is the path to the folder containing all scripts for preparing raw data-sets |
CONSTANT_DATA_PATH |
This is the path to the folder containing all prepared data for merging |
CONSTANT_PREPARE_DATASETS |
This is a vector of the names of data sets that need to be prepared using dedicated scripts |
CONSTANT_EXCLUDED_DATASETS |
This is a vector of the names of data sets that should NOT be included in the following merger |
CONSTANT_CATEGORICAL_DATA_COLUMNS |
Vector with the name of columns holding categorical trait data |
CONSTANT_CONTINOUS_DATA_COLUMNS |
Vector with the name of columns holding continuous (integers/doubles/floats) trait data |
CONSTANT_OTHER_COLUMNS |
Vector with the name of columns holding non-data information such as phylogenetic categories |
CONSTANT_ALL_DATA_COLUMNS |
Vector holding all data columns |
CONSTANT_FINAL_COLUMNS |
Vector holding all columns to be included in final output |
CONSTANT_DATA_FOR_RENAMING |
Vector holding the name of columns where the data will need to be re-named according to respective lookup tables |
CONSTANT_DATA_COMMA_CONCATENATED |
Vector holding names of traits that should be combined into a concatenated string for each species (may be concatenated on input as well) |
CONSTANT_SPECIAL_CATEGORICAL_TRAITS |
Vector holding column names of traits that should NOT be condensed using the standard condensation code |
CONSTANT_GENERAL_CATEGORICAL_PROCESSING |
Vector holding column names of traits that SHOULD be condensed using the standard condensation code |
CONSTANT_DOMINANT_TRAIT_PROPORTION |
The proportion (out of 100) that a trait value must occupy of a population in order to be selected as the appropriate representation of that species |
CONSTANT_FILL_GTDB_WITH_NCBI |
TRUE/FALSE Should the NCBI phylogeny should be used as space filler in the GTDB phylogeny |
CONSTANT_DOMINANT_TRAIT_PRIORITISE |
Sets whether the "max" or "min" stringent value (i.e. "obligate aerobe"" is more stringent and "aerobe") for a given trait should be chosen in case of a tie between two values |
CONSTANT_GROWTH_RATE_ADJUSTMENT_FINAL_TMP |
Temperature for calculating temperature standardised growth rates using Q10 |
CONSTANT_GROWTH_RATE_ADJUSTMENT_Q10 |
Q10 value for standardised growth rate calculations |
One setting is kept in workflow.R
. CONSTANT_BASE_PHYLOGENY is for choosing the taxonomy to be used for condensation ("NCBI" or Genome Taxonomy Database "GTDB")
The condensed_traits
is all the datasets combined and condensed into one row per trait and combined with full phylogeny based on the NCBI taxonomy map.
The condensed_species
is produced from the condensed_traits
where data has been condensed to one row per species as per the NCBI taxonomy.
- condensed_species_NCBI.csv
- condensed_traits_NCBI.csv
- condensed_species_GTDB.csv
- condensed_traits_GTDB.csv
R version [>3.5]
The workflow is outlined in workflow.R
. If you want to run the full merger including the preparation of each data source, then you only need to run the workflow.R
script:
source("workflow.R")
If you have updated a preparation script for an individual dataset or added a new dataset, you only need to run the particular preparation script (e.g., for corkrey
):
source("R/preparation/corkrey.R")
Then run the merger for all prepared data by:
- Indicating the phylogeny to use (CONSTANT_BASE_PHYLOGENY = "NCBI" or "GTDB")
source("R/condense_traits.R")
source("R/condense_species.R")
See workflow.R
for more detail.
- Create a folder in
data/raw/
with the name you want to identify the dataset by in the merger (e.g.,data/raw/newdata/
). - Place the new data file in this folder. Create a
README.md
file with details of the data sources origin and how the data will be processed. - In the
R/preparation/
folder create a new R script with the exact same name as the data set (e.g.,newdata.R
). This file should contain all code necessary to do the following:
- Load the new data set from
data/raw/newdata/
(assuming thenewdata
naming example above) - Clean up data, ensuring data formats are correct for each data type (e.g., fix Excel conversion of numbers to dates, deal with invalid charaters, etc.)
- Rename specific columns to their set column name based on data type (see section "column names and descriptions")
- Save the prepared dataset as a .csv file in
data/output/prepared_data/
, so for our example case this would be:newdata.csv
- Add the new data set name to the vector named "CONSTANT_PREPARE_DATASETS" located in
R/settings.R
. This ensures the data preparation file is run when running the full workflow.
If the new data set only includes data columns with names that are already present in the merger, nothing more is required. Simply re-run the workflow.R and the new data is included in the final outputs. If the new data set contains new data types / columns, see section "Adding a new data type".
When adding a new type of data to the merger (i.e., a new data column), it is necessary to tell the code how to process this data. This is done in the R/settings.R
file as follows:
Note: All vectors referred to in below are located in
R/settings.R
- For each new data type, determine whether the data is categorical (i.e., named values) or continuous (numbers), and add the name of the respective column to the appropriate vector
CONSTANT_CATEGORICAL_DATA_COLUMNS
orCONSTANT_CONTINUOUS_DATA_COLUMNS
. - If the new data require translation to a common terminology, a renaming table named "renaming_[new column name].csv" must be added to the data/conversion_tables/ folder, and the column name must be added to the
CONSTANT_DATA_FOR_RENAMING
vector. See section "Creating a conversion table" below for details. Some example columns are "metabolism", "sporulation", "motility". - If the new data is comma delimited on input (i.e. a data point takes the form of "x, y, z, ...") AND/OR if the data should be combined as comma delimited strings during species condensation, the name of the data column must be added to the vector
CONSTANT_DATA_COMMA_CONCATENATED
. This data will NOT be processed in any way other than ensured that only unique values are included in each string on output (i.e. inputs #1 = "x, y, z" and #2 = "y, u" for a given species are combined in output to "x, y, z, u"). Current example columns are "pathways" and "carbon_substrates". - If the new data is categorical but NOT of a general sort (or no grouping and priority list exists in the name conversion table) and therefore should NOT be processed using the general condensation scripts for categorical data, the column name must be included in the vector
CONSTANT_SPECIAL_CATEGORICAL_TRAITS
. This is also the case if the data is comma delimited (see #3). For special data columns, code for processing this particular data column should be added to the species condensation file where required.
If categorical data values require translation to a uniform terminology (usually always the case when combining multiple data sets of the same categorical data), then it is necessary to create a conversion table containing the original and new terms for each possible input term from any data set.
- These tables must be named "renaming_[new column name].csv".
- The table must contain the columns "Original" and "New", containing the original term and the term it should be translated to, respectively.
- If the respective categorical data is to be processed using the general species condensation script for determining which categorical value to chose from many options (using dominance and term priorities - see examples "metabolism", "sporulation", "motility" and others), the translation table must contain the two addtional columns "Priority" and "Category".
- "Category": this column contains numeric ids that groups different but related terms. For instance, the metabolisms "aerobic" and "obligate aerobic" is grouped in category #1, whereas the terms "anaerobic" and "obligate anaerobic" are grouped into category #2.
- "Priority": This column contains numeric ids that indicates the specificity or value of a term over another WITHIN each term category.For instance, the terms within category #1 above have prioirities "aerobic"" = 1 and "obligate aerobic" = 2 (higher values indicate higher priority) because the term "obligate aerobic" should be chosen over "aerobic" when the two terms are matched in occurence in the species condensation.
tax_id
: The NCBI taxonomy id at the lowest phylogenetic level identified by the species name
species_tax_id
: The NCBI taxonomy id at species level
species
: Species name of organism (this and below: NCBI or GTDB naming depending on the phylogeny chosen)
genus
: Genus of organism
family
Family of organism
order
: Order of organism
class
: Class of organism
phylum
: Phylum of organism
superkingdom
: Kingdom of organism
gram_stain
: Gram reaction of organism (+/-)
metabolism
: Oxygen use (aerobic, anaerobic etc.)
pathways
: List of metabolic pathways the organism can carry out (i.e. nitrate reduction, sulfur oxidation)
carbon_substrates
: List of carbon substrates the organism can utilise
sporulation
: If the organism sporulates (yes/no)
motility
: If the organism is motile and how
range_tmp
: Temperature range reported for the organism
range_salinity
: Salinity range reported for the organism
cell_shape
: Cell shape of organism
isolation_source
: Isolation sources reported for organism
d1_lo
: Lowest diameter
d1_up
: Largest diameter
d2_lo
: Smallest length
d2_up
: Largest length
doubling_h
: Minimum doubling time in hours
genome_size
: Genome size of organism
gc_contenvt
: GC content of organism (ratio)
coding_genes
: Number of coding genes
optimum_tmp
: Optimum temperature
optimum_ph
: Optimum pH
growth_tmp
: Reported growth temperature (not necessarily optimal)
rRNA16S_genes
: Number of 16S rRNA genes
tRNA_genes
: Number of tRNA genes
data_source
: List of data sources from where the information for the specific organism was obtained
ref_id
: List of reference ids to original litterature (where available) from where the data was obtained
intracellular
: If the organism has an intracellular lifestyle (0/1)
phototroph
: If the organism is phototrophic (0/1)
[col name].count
: All columns ending in ".count" indicates number of data points condensed for specific column value
[col name].prop
: All columns ending in ".prop" indicates the proportion of all condensed data points that agree with the chosen value
[col name].stdev
: All columns ending in ".stdev" indicates the standard deviation of the condensed data points