Skip to content

Commit

Permalink
Merge branch 'dev' into feature/remove-commented-code
Browse files Browse the repository at this point in the history
  • Loading branch information
hdbeukel authored May 13, 2024
2 parents 768eddb + ed3f43d commit 6503a60
Show file tree
Hide file tree
Showing 36 changed files with 101,216 additions and 285 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: MINI-AC test suite

on:
push:
branches: [ "main", "dev" ]
pull_request:
branches: [ "main", "dev" ]

jobs:
nf-test:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Prepare nf-test config file
run: sed -i -e "s@%TMP%@${RUNNER_TEMP}@g" tests/nextflow.config

- uses: actions/setup-java@v3
with:
distribution: oracle
java-version: 17

- name: Check Java version
run: java -version

- name: Setup Nextflow
uses: nf-core/[email protected]

- name: Setup singularity
uses: eWaterCycle/setup-singularity@v7
with:
singularity-version: 3.8.3

- name: Setup nf-test
run: wget -qO- https://code.askimed.com/install/nf-test | bash

- name: Fetch motif mapping files
run: |
curl -k -o tests/data/zma_v4_chr1/zma_v4_genome_wide_motif_mappings_chr1.bed https://floppy.psb.ugent.be/index.php/s/NekMYztyxEnsQiY/download/zma_v4_genome_wide_motif_mappings_chr1.bed
curl -k -o tests/data/zma_v4_chr1/zma_v4_locus_based_motif_mappings_5kbup_1kbdown_chr1.bed https://floppy.psb.ugent.be/index.php/s/r2wQmFjPy79qSp7/download/zma_v4_locus_based_motif_mappings_5kbup_1kbdown_chr1.bed
curl -k -o data/ath/ath_genome_wide_motif_mappings.bed https://floppy.psb.ugent.be/index.php/s/iaZPwdrRGe3YDdK/download/ath_genome_wide_motif_mappings.bed
curl -k -o data/ath/ath_locus_based_motif_mappings_5kbup_1kbdown.bed https://floppy.psb.ugent.be/index.php/s/qcQ7KndzHaSpd9e/download/ath_locus_based_motif_mappings_5kbup_1kbdown.bed
- name: Run nf-test
shell: bash
run: ./nf-test test
29 changes: 29 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# ignore Nextflow cache and logs
.nextflow/
.nextflow.log*

# ignore Singularity cache
singularity_cache/

# ignore large motif mapping files
*motif_mappings*.bed

# ignore nf-test executable
nf-test

# ignore test cache
.nf-test/

# ignore test outputs
tests/outputs/

# ignore SLURM output and error files
slurm.*.out
slurm.*.err

# ignore jupyter notebook checkpoints
.ipynb_checkpoints/

# python cache and compiled files
__pycache__/
*.pyc
28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,18 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr
4. Generation of a functional GRN by gene ontology (GO) enrichment of the regulons.
5. Integration of data to generate informative, user-friendly output files.

Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and maize. Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
* **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a denser signal of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/configuration_pipeline.md).
* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/configuration_pipeline.md).


A detailed overview of the necessary input files and expected output files can be found [here](example).
A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.


## **Inputs**
* **MINI-AC mode**: genome-wide or locus-based.
* **Species**: Arabidopsis or maize.
* **ACR files**: BED files containing genomic coordinates corresponding to accessible chromatin regions (minimal format of 3 columns: chromosome, start, stop). The ACR files' coordinates **must** correspond to the genome versions of Araport11 for Arabidopsis and AGPv4 for maize.
* **Species**: Arabidopsis or maize (maize genome version 4 or 5).
* **ACR files**: BED files containing genomic coordinates corresponding to accessible chromatin regions (minimal format of 3 columns: chromosome, start, stop). The ACR files' coordinates **must** correspond to the genome versions of Araport11 for Arabidopsis and B73 RefGen_v4 and B73 RefGen_v5 for maize.
* **Output folder**: Path where the results will be stored.
* **(Optional) DEGs file**: Tab-separated txt file with differential expression data associated with the input ACRs. The only format requirements are that the first row has to be the header (column names), and the first column has to contain gene IDs. There is no requirement for the number of columns or content, although it should contain statistics associated to a DE analysis.
* **(Optional) Expressed genes file**: One-column txt file with gene IDs for genes expressed in the biological context of the input ACRs, to filter the inferred GRNs.
Expand All @@ -41,12 +41,22 @@ The pipeline will run in parallel for multiple ACR BED input files. The two opti
* [Singularity](https://sylabs.io/guides/3.0/user-guide/index.html)
* [wget](https://www.gnu.org/software/wget/)
* Motif mapping files. They need to be downloaded by executing the following commands on the **top-level directory of the repository**:

For Arabidopsis

```
wget https://zenodo.org/record/7974527/files/ath_genome_wide_motif_mappings.bed?download=1 -O data/ath/ath_genome_wide_motif_mappings.bed
wget https://zenodo.org/record/7974527/files/ath_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/ath/ath_locus_based_motif_mappings_5kbup_1kbdown.bed
wget https://zenodo.org/record/7974527/files/zma_genome_wide_motif_mappings.bed?download=1 -O data/zma/zma_genome_wide_motif_mappings.bed
wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/zma/zma_locus_based_motif_mappings_5kbup_1kbdown.bed
```
For maize RefGen_v4
```
wget https://zenodo.org/record/7974527/files/zma_genome_wide_motif_mappings.bed?download=1 -O data/zma_v4/zma_v4_genome_wide_motif_mappings.bed
wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/zma_v4/zma_v4_locus_based_motif_mappings_5kbup_1kbdown.bed
```
For maize RefGen_v5
```
wget https://zenodo.org/record/8386283/files/zma_v5_genome_wide_motif_mappings.bed?download=1 -O data/zma_v5/zma_v5_genome_wide_motif_mappings.bed
wget https://zenodo.org/record/8386283/files/zma_v5_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 data/zma_v5/ -O zma_v5_locus_based_motif_mappings_5kbup_1kbdown.bed
```

NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10.6, Singularity version 3.8.7-1.el7 and in a Sun Grid Engine (SGE) computer cluster.
Expand All @@ -57,7 +67,7 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10
Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:

```shell
nextflow -C mini_ac.config run mini_ac.nf --mode genome_wide --species maize
nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
```

Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).
Expand All @@ -71,7 +81,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an

When publishing results generated using MINI-AC, please cite:

Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” bioRxiv, May 26, 2023. https://doi.org/10.1101/2023.05.26.542269.
Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.

## Contact

Expand Down
3 changes: 2 additions & 1 deletion bin/add_go_names.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
import go_manipulations

gene_go_file = argv[1]
ontology_file = argv[2]

go_tree = go_manipulations.GOtree(path.join(path.dirname(path.dirname(argv[0])), "ontologies", "go.obo"))
go_tree = go_manipulations.GOtree(ontology_file)

go_tree.add_descriptions(gene_go_file)
16 changes: 11 additions & 5 deletions bin/getGO_xlsx_gw.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def parseArgs():
parser.add_argument('-ex', '--expressed_genes_file', nargs = 1, type = str,
default = None, help = '',
metavar = 'List of genes expressed in biological context of experiment')

args = parser.parse_args()

return args
Expand Down Expand Up @@ -94,8 +94,11 @@ def parseArgs():

if not GO_info:
empty_table = pd.DataFrame(["### This dataset did not yield any GO enrichment"])
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
if(output_file.endswith('.csv')):
empty_table.to_csv(output_file, index = False, header = False)
else:
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
sys.exit()

### Integrating data ###
Expand Down Expand Up @@ -130,5 +133,8 @@ def parseArgs():

### Writing output file ###

with pd.ExcelWriter(output_file) as writer:
go_df.to_excel(writer, index = False)
if (output_file.endswith('.csv')):
go_df.to_csv(output_file, index = False)
else:
with pd.ExcelWriter(output_file) as writer:
go_df.to_excel(writer, index = False)
16 changes: 11 additions & 5 deletions bin/getGO_xlsx_lb.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def parseArgs():
parser.add_argument('-ex', '--expressed_genes_file', nargs = 1, type = str,
default = None, help = '',
metavar = 'List of genes expressed in biological context of experiment')

args = parser.parse_args()

return args
Expand Down Expand Up @@ -94,8 +94,11 @@ def parseArgs():

if not GO_info:
empty_table = pd.DataFrame(["### This dataset did not yield any GO enrichment"])
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
if(output_file.endswith('.csv')):
empty_table.to_csv(output_file, index = False, header = False)
else:
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
sys.exit()

### Integrating data ###
Expand Down Expand Up @@ -130,5 +133,8 @@ def parseArgs():

### Writing output file ###

with pd.ExcelWriter(output_file) as writer:
go_df.to_excel(writer, index = False)
if (output_file.endswith('.csv')):
go_df.to_csv(output_file, index = False)
else:
with pd.ExcelWriter(output_file) as writer:
go_df.to_excel(writer, index = False)
18 changes: 12 additions & 6 deletions bin/getMotifCentricOutput_gw.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,11 @@ def parseArgs():

if enr_stats.empty:
empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
if(output_file.endswith('.csv')):
empty_table.to_csv(output_file, index = False, header = False)
else:
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
sys.exit()

for col in enr_stats.select_dtypes(include = ['float']).columns:
Expand All @@ -122,11 +125,11 @@ def parseArgs():
if expressed_genes_file:
enr_stats['Any expressed gene'] = enr_stats.gene_id.isin(exp_genes)

enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)

if not expressed_genes_file:

enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)

enr_stats = enr_stats.merge(mot_tf, how = 'right', left_on = 'motif', right_on = 'motif_id').drop('motif_id', axis = 1)

Expand All @@ -146,5 +149,8 @@ def parseArgs():

### Writing output file ###

with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)
if (output_file.endswith('.csv')):
enr_stats.to_csv(output_file, index = False)
else:
with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)
18 changes: 12 additions & 6 deletions bin/getMotifCentricOutput_lb.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,11 @@ def parseArgs():

if enr_stats.empty:
empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
if(output_file.endswith('.csv')):
empty_table.to_csv(output_file, index = False, header = False)
else:
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
sys.exit()

for col in enr_stats.select_dtypes(include = ['float']).columns:
Expand All @@ -122,11 +125,11 @@ def parseArgs():
if expressed_genes_file:
enr_stats['Any expressed gene'] = enr_stats.gene_id.isin(exp_genes)

enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif','real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif','real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)

if not expressed_genes_file:

enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)

enr_stats = enr_stats.merge(mot_tf, how = 'right', left_on = 'motif', right_on = 'motif_id').drop('motif_id', axis = 1)

Expand All @@ -146,5 +149,8 @@ def parseArgs():

### Writing output file ###

with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)
if (output_file.endswith('.csv')):
enr_stats.to_csv(output_file, index = False)
else:
with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)
15 changes: 11 additions & 4 deletions bin/getTFCentricOutput_gw.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,11 @@ def parseArgs():

if enr_stats.empty:
empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
if(output_file.endswith('.csv')):
empty_table.to_csv(output_file, index = False, header = False)
else:
with pd.ExcelWriter(output_file) as writer:
empty_table.to_excel(writer, index = False, header = False)
sys.exit()

### Reading and processing GO enrichment data ###
Expand Down Expand Up @@ -261,5 +264,9 @@ def parseArgs():

### Writing output file ###

with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)
if (output_file.endswith('.csv')):
enr_stats.to_csv(output_file, index = False)
else:
with pd.ExcelWriter(output_file) as writer:
enr_stats.to_excel(writer, index = False)

Loading

0 comments on commit 6503a60

Please sign in to comment.