A Reproducible Snakemake-based Workflow for Project "Studying Salmonella Gene Expression Dynamics in Response to Novobiocin"

Project was performed as a part of studies in Bioinformatics institute on program Bioinformatics for Biologists during spring semester 2022.

README content

Introduction
Background
Aim
Project Objectives
Data Description
Snakemake Pipeline Description
Installation of Snakemake
Start running Snakemake pipeline
- Few errors solutions
Results
Useful References
Author and Acknowledgements
Feedback contacts

Introduction

In this project we analysed publicly available transcriptomic data obtained in time-series experiments on Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S treated with different concentrations of antibiotic novobiocin.

Background

DNA superspiralization is an important mechanism of gene regulation in bacteria and superspiralization alteration, for instance, in response to antibiotics can change gene expression levels. Antibiotics of aminocoumarin class, like novobiocin, are able to change both superspiralization of genomic DNA and expression of genes responding to superspiralization levels. The object of the study is Salmonella enterica, bacterium largely resistant to novobiocin and DNA superspiralization alterations in general. Here we aim to analyze gene expression changes in Salmonella cultures grown on media with varying concentrations of antibiotics and to identify clusters of coexpressing genes.

Aim

To investigate effects of novobiocin on gene expression in Salmonella enterica* subsp. enterica serovar Typhimurium str. 14028S.*

Project Objectives

1. align reads to reference genome;

2. obtain transcript count matrix;

3. DESeq2 analysis of the count matrix;

4. weighted gene co-expression network analysis using the WGCNA;

5. Wopper based visualization of modules of interest based on results of WGCNA analysis;

6. develop a reproducible workflow for transcriptomic data analysis based on Snakemake pipeline.

Data Description

Data is publicly available and can be found in the paper of Natalia Gogoleva et. al, 2020.

The provided dataset consists of RNA-seq reads obtained from samples of S. enterica cultures treated with novobiocin at concentrations of 100 and 500 μg per mL, and untreated cultures. These reads were already filtered using BBDuk (v. 37.23). In our project we performed analysis only on reads from untreated culture and with addition of 500 μg per mL of novobiocin at three different timepoints: 10, 20 and 60 min.

Snakemake Pipeline Desciption

Snakemake pipeline consists of 15 rules:

FastQC analysis of samples (SnakeMake Wrapper v 1.5.0/bio/fastqc)
Hisat2 Index (v 2.2.0)
Hisat2 Align (Snakemake wrapper v 1.5.0/bio/hisat2/align)
Samtools sort (Snakemake wrapper v1.5.0/bio/samtools/sort)
StringTie assembly (*v 2.2.1)
Matrix creation by prepDE.py3 (Python3 script)
DESeq2 (R Markdown)
Genes' names converter neg _pos (Python script)
KEGG_WGCNA (R Markdown)
Genes' names converter modules (Python script)
KEGG modules (R Markdown)
Genes' names converter hubs (Python script)
KEGG hub (R Markdown)
Reverse path (Python script)
Final Report (R Markdown)

R markdown files will automatically install the following packages during the run of Snakemake pipeline:

DESeq2, dplyr, ggplot2, gplots,clusterProfiler, WGCNA, knitr, KEGGREST,stringr

The rest of the used software will be installed during the creation of snakemake environment. All installed tools can be found in file config.yaml.

Installation of Snakemake

Full installation description you can find on the webpage of Snakemake

1) Install a Conda-based Python3 distribution

2) Install Mamba

Download miniforge

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-pypy3-MacOSX-x86_64.sh"
bash Mambaforge-pypy3-MacOSX-x86_64.sh

or

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-pypy3-MacOSX-x86_64.sh"
bash Mambaforge-pypy3-MacOSX-x86_64.sh

Install mamba

conda install -n base -c conda-forge mamba

3) Download repository and go to folder of snakemake

cd Path/to/folder/snakemake

4) Create snakemake environment

conda activate base
mamba env create --name snakemake --file environment.yaml

To delete environment

conda env remove -n snakemake

5) Activate snakemake environment

conda activate snakemake

To deactivate snakemake environment

conda deactivate

Snakemake help

snakemake --help

Start Snakemake pipeline

Data

RNA seq data

Download needed RNA seq data sets to folder "snakemake/data/reads".
Make sure that you have all needed files to run pipeline in folder "snakemake/data:
- Reference genome
- Reference annotation of
- pheno.csv # file with description of classes of experimental data
- File sorted_p_adj_genes.txt
- genefile.txt

snakemake --use-conda --cores all

Some mistakes solutions, suggestions and explanations

Possible mistakes and their solutions

LockException

LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to 
create the same files in the following directory:

LockException Solution

snakemake --unlock argument

Suggestions

--cores all # use all possible cores
--cores N   # use N number of cores

snakemake --use-conda --cores all -F # force to rerun fully Snakemake pipeline, otherwise 
                                     # it will run only that parts that did not run before

CreateCondaEnvironmentException

CreateCondaEnvironmentException:
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please configure strict priorities by executing 'conda config --set channel_priority strict'.

CreateCondaEnvironmentException solution

Explanations

*This is how the working pipeline starting its processes. In this part Snakemake describes which process will be performed and how many jobs per process (count).

Building DAG of jobs...
Using shell: /bin/bash
Job stats:
job                              count    min threads    max threads
-----------------------------  -------  -------------  -------------
KEGG_WGCNA                           1              1              1
KEGG_hub                             1              1              1
KEGG_modules                         1              1              1
all                                  1              1              1
deseq                                1              1              1
fastqc                              18              1              1
gene_name_converter_hubs             1              1              1
gene_name_converter_modules          1              1              1
gene_names_convereter_neg_pos        1              1              1
hisat2_align                        18              2              2
hisat2_index                         1              1              1
matrix_creation                      1              1              1
reverse_path                         1              1              1
samtools_sort                       18              8              8
stringtie_assembly                  18              8              8
total                               83              1              8

Results

Description of project results can be found in presentation in folder results

Additional analysis of the AT composition

Analysis of the AT-composition of the studied nodal genes; presented in a separate directory "AT_composition". The course of the analysis is written in the AT-composition notebook. Also in the directory is an additional file sequence.txt required for analysis (preprocessed file obtained from the annotation).

The result of Snakemake is Final_report.html file and opened in your browser maps of pathes of interest obtained via KEGG. In the results folder there is example of such file.

DESeq2

Genes for which log2FoldChange<1.5 and FDR<0.05 were considered insignificant; a total of 789 significant genes were identified. 370 positive genes and 419 negative genes were found:

A large number of differentially expressed genes does not give a clear idea of the nature of the changes under the action of an antibiotic:

For positive and negative differentially expressed genes, a KEGG enrichment analysis was performed. Positive DEGs:

C5-Branched dibasic acid metabolism
Valine, leucine and isoleucine biosynthesis
Two-component system
2-Oxocarboxylic acid metabolism
Ascorbate and aldarate metabolism

Negative DEGs:

Flagellar assembly
Oxidative phosphorylation
Glycolysis / Gluconeogenesis
Bacterial chemotaxis
Starch and sucrose metabolism

WGCNA

Samples clustering:

Biological replicas are best grouped with each other, with the exception of sample 25. The processed samples at the time points of 10 and 20 minutes and the control samples at the same time points were grouped into two separate groups. The control and treated samples were separated into a separate group at 60 minutes. The data demonstrate an increase in the similarity of gene expression profiles of treated and untreated samples at 60 minutes.

The image shows hierarchical clustering of samples (two variables are highlighted: Treated: 0 - white, 1 - black; Time: 10 - pink, 20 - blue, 60 - green).

Modules building:

After building the network, 10 connected modules and 1 zero module were defined. Number of genes in modules: 154, 056, 937, 927, 462, 425, 294, 179, 176, 72, 47.

Module-clustering dendrogram:

**Modules-Traits relationships:**

Modules associated with the “Treated” variable: brown, turquoise, black, yellow. Modules associated with the “Time” variable: green, blue, red.

Modules clustering:

*Modules black, brown, yellow with a strong positive correlation with the “Treated” variable were combined into one group.*

For significant modules, KEGG enrichment analysis was carried out. "Treated"-associated modules

Negative module:

Glycolysis / Gluconeogenesis
Amino sugar and nucleotide sugar metabolism
ABC transporters
Porphyrin metabolism
Histidine metabolism
beta-Lactam resistance

Positive modules:

Cationic antimicrobial peptide (CAMP) resistance
Two-component system
Lipopolysaccharide biosynthesis
C5-Branched dibasic acid metabolism
2-Oxocarboxylic acid metabolism
Two-component system
Ascorbate and aldarate metabolism
Bacterial secretion system

"Time"-associated modules

Positive module:

Sulfur metabolism
Lysine degradation
Arginine and proline metabolism
Microbial metabolism in diverse environments

Negative modules:

Sulfur relay system
Ribosome
Aminoacyl-tRNA biosynthesis

Hub genes analysis

The functionality of the R language was used to identify the hub genes in the module (hub genes were considered to have a correlation coefficient with the eigengene of more than 0.9). We obtained hub genes for modules associated with novobiocin and time. Number of hub genes for different modules: black - 42; brown - 167; yellow - 108; turquoise - 284; red - 94; blue - 147; green - 160. For hub genes from modules, KEGG enrichment analysis was carried out.

Pathways were found that were consistently found both among the hub genes of the modules and among the differentially expressed genes:

C5-Branched dibasic acid metabolism;
2-Oxocarboxylic acid metabolism;
Two-component system;
Glycolysis / Gluconeogenesis.

These pathways are the most sensitive pathways to novobiocin treatment; they were used to visualize the dynamics of their expression.

Analysis of the dynamics of significant pathways

The visualization of the dynamics of the hub genes pathways was carried out. As a generalization of gene expression, the first principal component was used for a subset of genes included in the module and included in a significant biological pathway.

| | |

Mapping modules per chromosome

The modules associated with "Treated" and "Time" were mapped to the chromosome using the WoPPER online tool: https://wopper.ba.itb.cnr.it/WoPPER#!/View/v1nnzudodwda9keleutqqrw8znqo

Time modules:	Treated modules:

AT-composition

The topological state of DNA influences its affinity for some DNA binding proteins, especially in DNA sequences that have a high A + T base content. We compared the AT composition of the hub gene sequences associated with novobiocin treatment with the AT composition of the hub genes associated with time.

The AT composition was calculated as the sum of A and T divided by the length of the sequence. This value was calculated for each gene from the list of hub genes. The results for the “Treated” and “Time” modules were combined. The distributions for the values were significantly different from normal (p-value = 2.2e-16 and p-value = 1.273e-12 for the "Treated" and "Time" modules, respectively), so the nonparametric Mann-Whitney test was used. The groups were significantly different from each other (p-value = 6.823e-11), although the difference between the means was small (0.496 and 0.470 for "Time" and "Treated", respectively).

The slight difference in the AT composition can be explained by the fact that it is important for supercoiling for the promoter regions of genes, so we studied it in the initial segments of the sequence using the sliding window method. A "window" of 30 nucleotides long (the approximate length of a promoter) was iteratively shifted one nucleotide from the beginning of the gene sequence; The AT composition was calculated within the limits of the window according to the above method. For each module, a sequence of values of the AT composition averaged for the hub genes of the module within the "window" is calculated.

The graph shows the frequencies of A and T, calculated within a 30-nucleotide "window". The X-axis shows a section from 1 to 300 nucleotides of the gene under study. The Y-axis plots the average frequency of A and T (AT-composition) for the hub genes of the module in a given position of the "window". Blue shows the frequencies for modules that are positively correlated with the "Treated" variable; the module shown in blue is negatively correlated with the "Treated" variable; modules in purple are not correlated with the "Treated" variable (associated with the "Time" variable).

The negatively correlated module showed frequencies close to the "Time" modules, so we focused on the positive "Treated" modules. For positive modules "Treated" and for modules "Time" the values were averaged. Green indicates A and T frequencies for positive "Treated" modules, blue indicates frequencies for "Time" modules. The X and Y values correspond to the previous plot.

Discussion

Biological pathways may have different dynamics over time; Two-component system genes increase their expression by 20 minutes with a subsequent decrease; C5-Branched dibasic acid metabolism and 2-Oxocarboxylic acid metabolism genes increase expression over time; Glycolysis/Gluconeogenesis genes increase their expression over time in control samples, however, when treated with novobiocin, they do not show any noticeable dynamics. Two-component system is a system for perceiving changes in the environment; this system can also be associated with a change in supercoiling, which also perceives a large number of stress stimuli (pH, osmotic composition, etc.). The lack of dynamics in the expression of glycolysis pathway genes may be associated with the bacteriostatic effect of novobiocin (these processes are associated with anabolic pathways); C5-Branched dibasic acid metabolism and 2-Oxocarboxylic acid metabolism may be involved in some of the signaling pathways associated with changes in supercoiling.

The co-expression modules show a diffuse distribution along the length of the chromosome, which may correspond to the influence of a systemic process, such as a change in supercoiling. Genes in co-expressed modules are located at significant distances from each other (more than 1Mp). This can be explained by the topology of the chromosome, but does not exclude the influence of DNA supercoiling on the activation of modules. The initial regions of genes sensitive to novobiocin are characterized by a richer AT composition, which suggests sensitivity to changes in DNA supercoiling.

Useful References

Authors and Acknowledgements

Students

Supervisor

Alexandr Tkachenko

Feedback

If you have any questions, suggestions or complains please approach authors via email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
AT_composition		AT_composition
Results		Results
extra		extra
snakemake		snakemake
.gitignore		.gitignore
README.md		README.md
requirement.txt		requirement.txt

ValeriiaLadyhina/BI_Project_analysis_of_effect_of-novobiocin_on_Salmonella

Folders and files

Latest commit

History

Repository files navigation