A Reproducible Snakemake-based Workflow for Project "Studying Salmonella Gene Expression Dynamics in Response to Novobiocin"
Project was performed as a part of studies in Bioinformatics institute on program Bioinformatics for Biologists during spring semester 2022.
- Introduction
- Background
- Aim
- Project Objectives
- Data Description
- Snakemake Pipeline Description
- Installation of Snakemake
- Start running Snakemake pipeline
- Results
- Useful References
- Author and Acknowledgements
- Feedback contacts
In this project we analysed publicly available transcriptomic data obtained in time-series experiments on Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S treated with different concentrations of antibiotic novobiocin.
DNA superspiralization is an important mechanism of gene regulation in bacteria and superspiralization alteration, for instance, in response to antibiotics can change gene expression levels. Antibiotics of aminocoumarin class, like novobiocin, are able to change both superspiralization of genomic DNA and expression of genes responding to superspiralization levels. The object of the study is Salmonella enterica, bacterium largely resistant to novobiocin and DNA superspiralization alterations in general. Here we aim to analyze gene expression changes in Salmonella cultures grown on media with varying concentrations of antibiotics and to identify clusters of coexpressing genes.
To investigate effects of novobiocin on gene expression in Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S.
3. DESeq2 analysis of the count matrix;
4. weighted gene co-expression network analysis using the WGCNA;
5. Wopper based visualization of modules of interest based on results of WGCNA analysis;
6. develop a reproducible workflow for transcriptomic data analysis based on Snakemake pipeline.
Data is publicly available and can be found in the paper of Natalia Gogoleva et. al, 2020.
The provided dataset consists of RNA-seq reads obtained from samples of S. enterica cultures treated with novobiocin at concentrations of 100 and 500 μg per mL, and untreated cultures. These reads were already filtered using BBDuk (v. 37.23). In our project we performed analysis only on reads from untreated culture and with addition of 500 μg per mL of novobiocin at three different timepoints: 10, 20 and 60 min.
- FastQC analysis of samples (SnakeMake Wrapper v 1.5.0/bio/fastqc)
- Hisat2 Index (v 2.2.0)
- Hisat2 Align (Snakemake wrapper v 1.5.0/bio/hisat2/align)
- Samtools sort (Snakemake wrapper v1.5.0/bio/samtools/sort)
- StringTie assembly (*v 2.2.1)
- Matrix creation by prepDE.py3 (Python3 script)
- DESeq2 (R Markdown)
- Genes' names converter neg _pos (Python script)
- KEGG_WGCNA (R Markdown)
- Genes' names converter modules (Python script)
- KEGG modules (R Markdown)
- Genes' names converter hubs (Python script)
- KEGG hub (R Markdown)
- Reverse path (Python script)
- Final Report (R Markdown)
R markdown files will automatically install the following packages during the run of Snakemake pipeline:
DESeq2, dplyr, ggplot2, gplots,clusterProfiler, WGCNA, knitr, KEGGREST,stringr
The rest of the used software will be installed during the creation of snakemake environment. All installed tools can be found in file config.yaml.
Full installation description you can find on the webpage of Snakemake
2) Install Mamba
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-pypy3-MacOSX-x86_64.sh"
bash Mambaforge-pypy3-MacOSX-x86_64.sh
or
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-pypy3-MacOSX-x86_64.sh"
bash Mambaforge-pypy3-MacOSX-x86_64.sh
conda install -n base -c conda-forge mamba
3) Download repository and go to folder of snakemake
cd Path/to/folder/snakemake
conda activate base
mamba env create --name snakemake --file environment.yaml
- To delete environment
conda env remove -n snakemake
conda activate snakemake
- To deactivate snakemake environment
conda deactivate
snakemake --help
- Download needed RNA seq data sets to folder "snakemake/data/reads".
- Make sure that you have all needed files to run pipeline in folder "snakemake/data:
- Reference genome
- Reference annotation of
- pheno.csv # file with description of classes of experimental data
- File sorted_p_adj_genes.txt
- genefile.txt
snakemake --use-conda --cores all
LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to
create the same files in the following directory:
snakemake --unlock argument
--cores all # use all possible cores
--cores N # use N number of cores
snakemake --use-conda --cores all -F # force to rerun fully Snakemake pipeline, otherwise
# it will run only that parts that did not run before
CreateCondaEnvironmentException:
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please configure strict priorities by executing 'conda config --set channel_priority strict'.
*This is how the working pipeline starting its processes. In this part Snakemake describes which process will be performed and how many jobs per process (count).
Building DAG of jobs...
Using shell: /bin/bash
Job stats:
job count min threads max threads
----------------------------- ------- ------------- -------------
KEGG_WGCNA 1 1 1
KEGG_hub 1 1 1
KEGG_modules 1 1 1
all 1 1 1
deseq 1 1 1
fastqc 18 1 1
gene_name_converter_hubs 1 1 1
gene_name_converter_modules 1 1 1
gene_names_convereter_neg_pos 1 1 1
hisat2_align 18 2 2
hisat2_index 1 1 1
matrix_creation 1 1 1
reverse_path 1 1 1
samtools_sort 18 8 8
stringtie_assembly 18 8 8
total 83 1 8
Analysis of the AT-composition of the studied nodal genes; presented in a separate directory "AT_composition". The course of the analysis is written in the AT-composition notebook. Also in the directory is an additional file sequence.txt required for analysis (preprocessed file obtained from the annotation).
The result of Snakemake is Final_report.html file and opened in your browser maps of pathes of interest obtained via KEGG. In the results folder there is example of such file.
Genes for which log2FoldChange<1.5 and FDR<0.05 were considered insignificant; a total of 789 significant genes were identified. 370 positive genes and 419 negative genes were found:
A large number of differentially expressed genes does not give a clear idea of the nature of the changes under the action of an antibiotic:
For positive and negative differentially expressed genes, a KEGG enrichment analysis was performed. Positive DEGs:
- C5-Branched dibasic acid metabolism
- Valine, leucine and isoleucine biosynthesis
- Two-component system
- 2-Oxocarboxylic acid metabolism
- Ascorbate and aldarate metabolism
Negative DEGs:
- Flagellar assembly
- Oxidative phosphorylation
- Glycolysis / Gluconeogenesis
- Bacterial chemotaxis
- Starch and sucrose metabolism
Biological replicas are best grouped with each other, with the exception of sample 25. The processed samples at the time points of 10 and 20 minutes and the control samples at the same time points were grouped into two separate groups. The control and treated samples were separated into a separate group at 60 minutes. The data demonstrate an increase in the similarity of gene expression profiles of treated and untreated samples at 60 minutes.
The image shows hierarchical clustering of samples (two variables are highlighted: Treated: 0 - white, 1 - black; Time: 10 - pink, 20 - blue, 60 - green).
After building the network, 10 connected modules and 1 zero module were defined. Number of genes in modules: 154, 056, 937, 927, 462, 425, 294, 179, 176, 72, 47.
Module-clustering dendrogram:
**Modules-Traits relationships:**Modules associated with the “Treated” variable: brown, turquoise, black, yellow. Modules associated with the “Time” variable: green, blue, red.
Modules clustering:
*Modules black, brown, yellow with a strong positive correlation with the “Treated” variable were combined into one group.*For significant modules, KEGG enrichment analysis was carried out. "Treated"-associated modules
Negative module:
- Glycolysis / Gluconeogenesis
- Amino sugar and nucleotide sugar metabolism
- ABC transporters
- Porphyrin metabolism
- Histidine metabolism
- beta-Lactam resistance
Positive modules:
- Cationic antimicrobial peptide (CAMP) resistance
- Two-component system
- Lipopolysaccharide biosynthesis
- C5-Branched dibasic acid metabolism
- 2-Oxocarboxylic acid metabolism
- Two-component system
- Ascorbate and aldarate metabolism
- Bacterial secretion system
"Time"-associated modules
Positive module:
- Sulfur metabolism
- Lysine degradation
- Arginine and proline metabolism
- Microbial metabolism in diverse environments
Negative modules:
- Sulfur relay system
- Ribosome
- Aminoacyl-tRNA biosynthesis
The functionality of the R language was used to identify the hub genes in the module (hub genes were considered to have a correlation coefficient with the eigengene of more than 0.9). We obtained hub genes for modules associated with novobiocin and time. Number of hub genes for different modules: black - 42; brown - 167; yellow - 108; turquoise - 284; red - 94; blue - 147; green - 160. For hub genes from modules, KEGG enrichment analysis was carried out.
Pathways were found that were consistently found both among the hub genes of the modules and among the differentially expressed genes:
- C5-Branched dibasic acid metabolism;
- 2-Oxocarboxylic acid metabolism;
- Two-component system;
- Glycolysis / Gluconeogenesis.
These pathways are the most sensitive pathways to novobiocin treatment; they were used to visualize the dynamics of their expression.
The visualization of the dynamics of the hub genes pathways was carried out. As a generalization of gene expression, the first principal component was used for a subset of genes included in the module and included in a significant biological pathway.
The modules associated with "Treated" and "Time" were mapped to the chromosome using the WoPPER online tool: https://wopper.ba.itb.cnr.it/WoPPER#!/View/v1nnzudodwda9keleutqqrw8znqo
Time modules: | Treated modules: |
---|---|
The topological state of DNA influences its affinity for some DNA binding proteins, especially in DNA sequences that have a high A + T base content. We compared the AT composition of the hub gene sequences associated with novobiocin treatment with the AT composition of the hub genes associated with time.
The AT composition was calculated as the sum of A and T divided by the length of the sequence. This value was calculated for each gene from the list of hub genes. The results for the “Treated” and “Time” modules were combined. The distributions for the values were significantly different from normal (p-value = 2.2e-16 and p-value = 1.273e-12 for the "Treated" and "Time" modules, respectively), so the nonparametric Mann-Whitney test was used. The groups were significantly different from each other (p-value = 6.823e-11), although the difference between the means was small (0.496 and 0.470 for "Time" and "Treated", respectively).
The slight difference in the AT composition can be explained by the fact that it is important for supercoiling for the promoter regions of genes, so we studied it in the initial segments of the sequence using the sliding window method. A "window" of 30 nucleotides long (the approximate length of a promoter) was iteratively shifted one nucleotide from the beginning of the gene sequence; The AT composition was calculated within the limits of the window according to the above method. For each module, a sequence of values of the AT composition averaged for the hub genes of the module within the "window" is calculated.
The graph shows the frequencies of A and T, calculated within a 30-nucleotide "window". The X-axis shows a section from 1 to 300 nucleotides of the gene under study. The Y-axis plots the average frequency of A and T (AT-composition) for the hub genes of the module in a given position of the "window". Blue shows the frequencies for modules that are positively correlated with the "Treated" variable; the module shown in blue is negatively correlated with the "Treated" variable; modules in purple are not correlated with the "Treated" variable (associated with the "Time" variable).
The negatively correlated module showed frequencies close to the "Time" modules, so we focused on the positive "Treated" modules. For positive modules "Treated" and for modules "Time" the values were averaged. Green indicates A and T frequencies for positive "Treated" modules, blue indicates frequencies for "Time" modules. The X and Y values correspond to the previous plot.
Biological pathways may have different dynamics over time; Two-component system genes increase their expression by 20 minutes with a subsequent decrease; C5-Branched dibasic acid metabolism and 2-Oxocarboxylic acid metabolism genes increase expression over time; Glycolysis/Gluconeogenesis genes increase their expression over time in control samples, however, when treated with novobiocin, they do not show any noticeable dynamics. Two-component system is a system for perceiving changes in the environment; this system can also be associated with a change in supercoiling, which also perceives a large number of stress stimuli (pH, osmotic composition, etc.). The lack of dynamics in the expression of glycolysis pathway genes may be associated with the bacteriostatic effect of novobiocin (these processes are associated with anabolic pathways); C5-Branched dibasic acid metabolism and 2-Oxocarboxylic acid metabolism may be involved in some of the signaling pathways associated with changes in supercoiling.
The co-expression modules show a diffuse distribution along the length of the chromosome, which may correspond to the influence of a systemic process, such as a change in supercoiling. Genes in co-expressed modules are located at significant distances from each other (more than 1Mp). This can be explained by the topology of the chromosome, but does not exclude the influence of DNA supercoiling on the activation of modules. The initial regions of genes sensitive to novobiocin are characterized by a richer AT composition, which suggests sensitivity to changes in DNA supercoiling.
- Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]. F1000Research 2021, 10:33
- Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019)
- Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
- Puccio, S., Grillo, G., Licciulli, F., Severgnini, M., Liuni, S., Bicciato, S., Peano, C. (2017). WoPPER: Web server for Position Related data analysis of gene Expression in Prokaryotes. Nucleic Acids Research, 45(W1), W109–W115.
- Zhang B and Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis, Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17 PMID: 16646834
- Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
- Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M Transcriptome assembly from long-read RNA-seq alignments with StringTie2, Genome Biology 20, 278 (2019), doi:10.1186/s13059-019-1910-1
- Muskhelishvili G, Forquet R, Reverchon S, Meyer S, Nasser W. Coherent Domains of Transcription Coordinate Gene Expression During Bacterial Growth and Adaptation. Microorganisms. 2019 Dec 13;7(12):694. doi: 10.3390/microorganisms7120694. PMID: 31847191; PMCID: PMC6956064.
If you have any questions, suggestions or complains please approach authors via email: [email protected]