Skip to content

Latest commit

 

History

History
197 lines (160 loc) · 15 KB

README.md

File metadata and controls

197 lines (160 loc) · 15 KB

Bacteria and archaea phenotypic traits

The coded workflow corresponding with the data descriptor: Madin et al. (XXXX) A global synthesis of bacterial and archaeal phenotypic trait and environment data. Scientific Data XXXX

Current datasets included in merger workflow

  1. amend-shock.csv
  2. bacdive-microa.csv
  3. bergeys.csv
  4. campedelli.csv
  5. corkrey.csv
  6. edwards.csv
  7. engqvist.csv
  8. faprotax.csv
  9. fierer.csv
  10. genbank.csv
  11. gold.csv
  12. jemma-refseq.csv
  13. kegg.csv
  14. kremer.csv
  15. masonmm.csv
  16. mediadb.csv
  17. methanogen.csv
  18. microbe-directory.csv
  19. nielsensl.csv
  20. pasteur.csv
  21. patric.csv
  22. prochlorococcus.csv
  23. protraits.csv
  24. roden-jin.csv
  25. rrndb.csv
  26. silva.csv

Overview of project files

Directory Content
data/ Contains all the raw data. This directory is read only (i.e., do not change these files!)
data/conversion_tables/ Contains tables for mapping traits among datasets to ensure standard naming and values. This directory also contains the data_corrections file for changes to records with errors.
data/raw/ Contains the raw data file for each of the datasets in the merger. Each raw data sub-directory contains a README file that explains where the dataset came from and notes about how it was processed.
data/taxonomy/ Contains the NCBI taxonomy database dump.
R/ All R code for preparing and merging/condensing the final datasets, including files containing global functions, loading necessary R packages, and global settings (outlines below). Data preparation files are in the preparation directory. The main merge files are condense_species.R and condense_traits.R.
R/preparation Contains the individual dataset preparation scripts.
R/taxonomy Contains the NCBI taxonomy and taxonomic mapping preparation scripts.
R/web_extraction Contains script for utilizing APIs for extracting data from web-based datasets.
output/ For all R code output that is used for analysis or during the merge and preparation process. Everything in this folder can be regenerated by the R code.
output/prepared_data/ Contains the processed 'cleaned up' version of all included data frames - these are generated by the individual clean scripts for each dataset and are loaded into the merger code
output/prepared_references/ Contains the processed reference table.
output/taxonomy/ Contains pre-processed taxonomy maps and the main NCBI mapping table with other taxonomic levels.
.gitignore What should be ignored by git versioning; also won't appear on Github. This currently contains important files that are too large for Github, like the NCBI taxonomy and the Bergey's PDFs.
README.md This is what you're reading. There are other "README" files through the file system, including the data directory for information about datasets, etc.
workflow.R The final analysis code that loads and prepares each raw dataset and then runs the merge and condensations.
bacteria_archaea_traits.Rproj You double-click on this when you've cloned the GitHub project to your computer, and the project is opened in RStudio. Otherwise, this file that can be ignored.

Large datasets not included in Github

There are two datasets too large to be kept at GitHub, and these are automatically downloaded from Figshare when workflow.R is first run. If you experience problems, these files need to be manually downloaded from their sources and placed in the project directory as outlined here.

File Where to access Where to place
genome_metadata.txt ftp://ftp.patricbrc.org/RELEASE_NOTES/ data/raw/patric/
taxonomy_names.csv https://figshare.com/s/ab40d2a35266d729698c output/taxonomy/

Overview of settings file

The settings.R file in the R directory is where the different scripts are told which data sets to include in the merger and how to handle different traits and whether they will need to be translated. This file will need to be updated if new traits are added to the merger by inserting the respective column name into the appropriate vector variable (see description below).

Any varaible that is required in the preparation code is named in capital and with the word "CONSTANT_" in front so as to avoid being accidentally overwritten.

Variable Description
CONSTANT_PREPARE_FILE_PATH This is the path to the folder containing all scripts for preparing raw data-sets
CONSTANT_DATA_PATH This is the path to the folder containing all prepared data for merging
CONSTANT_PREPARE_DATASETS This is a vector of the names of data sets that need to be prepared using dedicated scripts
CONSTANT_EXCLUDED_DATASETS This is a vector of the names of data sets that should NOT be included in the following merger
CONSTANT_CATEGORICAL_DATA_COLUMNS Vector with the name of columns holding categorical trait data
CONSTANT_CONTINOUS_DATA_COLUMNS Vector with the name of columns holding continuous (integers/doubles/floats) trait data
CONSTANT_OTHER_COLUMNS Vector with the name of columns holding non-data information such as phylogenetic categories
CONSTANT_ALL_DATA_COLUMNS Vector holding all data columns
CONSTANT_FINAL_COLUMNS Vector holding all columns to be included in final output
CONSTANT_DATA_FOR_RENAMING Vector holding the name of columns where the data will need to be re-named according to respective lookup tables
CONSTANT_DATA_COMMA_CONCATENATED Vector holding names of traits that should be combined into a concatenated string for each species (may be concatenated on input as well)
CONSTANT_SPECIAL_CATEGORICAL_TRAITS Vector holding column names of traits that should NOT be condensed using the standard condensation code
CONSTANT_GENERAL_CATEGORICAL_PROCESSING Vector holding column names of traits that SHOULD be condensed using the standard condensation code
CONSTANT_DOMINANT_TRAIT_PROPORTION The proportion (out of 100) that a trait value must occupy of a population in order to be selected as the appropriate representation of that species
CONSTANT_FILL_GTDB_WITH_NCBI TRUE/FALSE Should the NCBI phylogeny should be used as space filler in the GTDB phylogeny
CONSTANT_DOMINANT_TRAIT_PRIORITISE Sets whether the "max" or "min" stringent value (i.e. "obligate aerobe"" is more stringent and "aerobe") for a given trait should be chosen in case of a tie between two values
CONSTANT_GROWTH_RATE_ADJUSTMENT_FINAL_TMP Temperature for calculating temperature standardised growth rates using Q10
CONSTANT_GROWTH_RATE_ADJUSTMENT_Q10 Q10 value for standardised growth rate calculations

Merged and condensed files

One setting is kept in workflow.R. CONSTANT_BASE_PHYLOGENY is for choosing the taxonomy to be used for condensation ("NCBI" or Genome Taxonomy Database "GTDB")

The condensed_traits is all the datasets combined and condensed into one row per trait and combined with full phylogeny based on the NCBI taxonomy map.

The condensed_species is produced from the condensed_traits where data has been condensed to one row per species as per the NCBI taxonomy.

Run scripts

R version [>3.5]

The workflow is outlined in workflow.R. If you want to run the full merger including the preparation of each data source, then you only need to run the workflow.R script:

source("workflow.R")

If you have updated a preparation script for an individual dataset or added a new dataset, you only need to run the particular preparation script (e.g., for corkrey):

source("R/preparation/corkrey.R")

Then run the merger for all prepared data by:

  1. Indicating the phylogeny to use (CONSTANT_BASE_PHYLOGENY = "NCBI" or "GTDB")
  2. source("R/condense_traits.R")
  3. source("R/condense_species.R")

See workflow.R for more detail.

Adding a new data source

  1. Create a folder in data/raw/ with the name you want to identify the dataset by in the merger (e.g., data/raw/newdata/).
  2. Place the new data file in this folder. Create a README.md file with details of the data sources origin and how the data will be processed.
  3. In the R/preparation/ folder create a new R script with the exact same name as the data set (e.g., newdata.R). This file should contain all code necessary to do the following:
  • Load the new data set from data/raw/newdata/ (assuming the newdata naming example above)
  • Clean up data, ensuring data formats are correct for each data type (e.g., fix Excel conversion of numbers to dates, deal with invalid charaters, etc.)
  • Rename specific columns to their set column name based on data type (see section "column names and descriptions")
  • Save the prepared dataset as a .csv file in data/output/prepared_data/, so for our example case this would be: newdata.csv
  1. Add the new data set name to the vector named "CONSTANT_PREPARE_DATASETS" located in R/settings.R. This ensures the data preparation file is run when running the full workflow.

If the new data set only includes data columns with names that are already present in the merger, nothing more is required. Simply re-run the workflow.R and the new data is included in the final outputs. If the new data set contains new data types / columns, see section "Adding a new data type".

Adding a new data type

When adding a new type of data to the merger (i.e., a new data column), it is necessary to tell the code how to process this data. This is done in the R/settings.R file as follows:

Note: All vectors referred to in below are located in R/settings.R

  1. For each new data type, determine whether the data is categorical (i.e., named values) or continuous (numbers), and add the name of the respective column to the appropriate vector CONSTANT_CATEGORICAL_DATA_COLUMNS or CONSTANT_CONTINUOUS_DATA_COLUMNS.
  2. If the new data require translation to a common terminology, a renaming table named "renaming_[new column name].csv" must be added to the data/conversion_tables/ folder, and the column name must be added to the CONSTANT_DATA_FOR_RENAMING vector. See section "Creating a conversion table" below for details. Some example columns are "metabolism", "sporulation", "motility".
  3. If the new data is comma delimited on input (i.e. a data point takes the form of "x, y, z, ...") AND/OR if the data should be combined as comma delimited strings during species condensation, the name of the data column must be added to the vector CONSTANT_DATA_COMMA_CONCATENATED. This data will NOT be processed in any way other than ensured that only unique values are included in each string on output (i.e. inputs #1 = "x, y, z" and #2 = "y, u" for a given species are combined in output to "x, y, z, u"). Current example columns are "pathways" and "carbon_substrates".
  4. If the new data is categorical but NOT of a general sort (or no grouping and priority list exists in the name conversion table) and therefore should NOT be processed using the general condensation scripts for categorical data, the column name must be included in the vector CONSTANT_SPECIAL_CATEGORICAL_TRAITS. This is also the case if the data is comma delimited (see #3). For special data columns, code for processing this particular data column should be added to the species condensation file where required.

Creating a conversion table

If categorical data values require translation to a uniform terminology (usually always the case when combining multiple data sets of the same categorical data), then it is necessary to create a conversion table containing the original and new terms for each possible input term from any data set.

  • These tables must be named "renaming_[new column name].csv".
  • The table must contain the columns "Original" and "New", containing the original term and the term it should be translated to, respectively.
  • If the respective categorical data is to be processed using the general species condensation script for determining which categorical value to chose from many options (using dominance and term priorities - see examples "metabolism", "sporulation", "motility" and others), the translation table must contain the two addtional columns "Priority" and "Category".
    • "Category": this column contains numeric ids that groups different but related terms. For instance, the metabolisms "aerobic" and "obligate aerobic" is grouped in category #1, whereas the terms "anaerobic" and "obligate anaerobic" are grouped into category #2.
    • "Priority": This column contains numeric ids that indicates the specificity or value of a term over another WITHIN each term category.For instance, the terms within category #1 above have prioirities "aerobic"" = 1 and "obligate aerobic" = 2 (higher values indicate higher priority) because the term "obligate aerobic" should be chosen over "aerobic" when the two terms are matched in occurence in the species condensation.

Column names and descriptions in species condensed data set

tax_id: The NCBI taxonomy id at the lowest phylogenetic level identified by the species name species_tax_id: The NCBI taxonomy id at species level species: Species name of organism (this and below: NCBI or GTDB naming depending on the phylogeny chosen) genus: Genus of organism familyFamily of organism order: Order of organism class: Class of organism phylum: Phylum of organism superkingdom: Kingdom of organism gram_stain: Gram reaction of organism (+/-) metabolism: Oxygen use (aerobic, anaerobic etc.) pathways: List of metabolic pathways the organism can carry out (i.e. nitrate reduction, sulfur oxidation) carbon_substrates: List of carbon substrates the organism can utilise sporulation: If the organism sporulates (yes/no) motility: If the organism is motile and how range_tmp: Temperature range reported for the organism range_salinity: Salinity range reported for the organism cell_shape: Cell shape of organism isolation_source: Isolation sources reported for organism d1_lo: Lowest diameter d1_up: Largest diameter d2_lo: Smallest length d2_up: Largest length doubling_h: Minimum doubling time in hours genome_size: Genome size of organism gc_contenvt: GC content of organism (ratio) coding_genes: Number of coding genes optimum_tmp: Optimum temperature optimum_ph: Optimum pH growth_tmp: Reported growth temperature (not necessarily optimal) rRNA16S_genes: Number of 16S rRNA genes tRNA_genes: Number of tRNA genes data_source: List of data sources from where the information for the specific organism was obtained ref_id: List of reference ids to original litterature (where available) from where the data was obtained intracellular: If the organism has an intracellular lifestyle (0/1) phototroph: If the organism is phototrophic (0/1) [col name].count: All columns ending in ".count" indicates number of data points condensed for specific column value [col name].prop: All columns ending in ".prop" indicates the proportion of all condensed data points that agree with the chosen value [col name].stdev: All columns ending in ".stdev" indicates the standard deviation of the condensed data points