Merge branch 'dev' into feature/remove-commented-code

VIB-PSB · May 13, 2024 · 6503a60 · 6503a60
2 parents 768eddb + ed3f43d
commit 6503a60
Show file tree

Hide file tree

Showing 36 changed files with 101,216 additions and 285 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,48 @@
+name: MINI-AC test suite
+
+on:
+  push:
+    branches: [ "main", "dev" ]
+  pull_request:
+    branches: [ "main", "dev" ]
+
+jobs:
+  nf-test:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Prepare nf-test config file
+      run: sed -i -e "s@%TMP%@${RUNNER_TEMP}@g" tests/nextflow.config
+
+    - uses: actions/setup-java@v3
+      with:
+        distribution: oracle
+        java-version: 17
+
+    - name: Check Java version
+      run: java -version
+
+    - name: Setup Nextflow
+      uses: nf-core/[email protected]
+
+    - name: Setup singularity
+      uses: eWaterCycle/setup-singularity@v7
+      with:
+        singularity-version: 3.8.3
+
+    - name: Setup nf-test
+      run: wget -qO- https://code.askimed.com/install/nf-test | bash
+
+    - name: Fetch motif mapping files
+      run: |
+        curl -k -o tests/data/zma_v4_chr1/zma_v4_genome_wide_motif_mappings_chr1.bed https://floppy.psb.ugent.be/index.php/s/NekMYztyxEnsQiY/download/zma_v4_genome_wide_motif_mappings_chr1.bed
+        curl -k -o tests/data/zma_v4_chr1/zma_v4_locus_based_motif_mappings_5kbup_1kbdown_chr1.bed https://floppy.psb.ugent.be/index.php/s/r2wQmFjPy79qSp7/download/zma_v4_locus_based_motif_mappings_5kbup_1kbdown_chr1.bed
+        curl -k -o data/ath/ath_genome_wide_motif_mappings.bed https://floppy.psb.ugent.be/index.php/s/iaZPwdrRGe3YDdK/download/ath_genome_wide_motif_mappings.bed
+        curl -k -o data/ath/ath_locus_based_motif_mappings_5kbup_1kbdown.bed https://floppy.psb.ugent.be/index.php/s/qcQ7KndzHaSpd9e/download/ath_locus_based_motif_mappings_5kbup_1kbdown.bed
+
+    - name: Run nf-test
+      shell: bash
+      run: ./nf-test test
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,29 @@
+# ignore Nextflow cache and logs
+.nextflow/
+.nextflow.log*
+
+# ignore Singularity cache
+singularity_cache/
+
+# ignore large motif mapping files
+*motif_mappings*.bed
+
+# ignore nf-test executable
+nf-test
+
+# ignore test cache
+.nf-test/
+
+# ignore test outputs
+tests/outputs/
+
+# ignore SLURM output and error files
+slurm.*.out
+slurm.*.err
+
+# ignore jupyter notebook checkpoints
+.ipynb_checkpoints/
+
+# python cache and compiled files
+__pycache__/
+*.pyc
diff --git a/README.md b/README.md
@@ -11,18 +11,18 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr
 4. Generation of a functional GRN by gene ontology (GO) enrichment of the regulons.
 5. Integration of data to generate informative, user-friendly output files.
 
-Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and maize. Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
+Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
 * **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
-* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a denser signal of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/configuration_pipeline.md).
+* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/configuration_pipeline.md).
 
 
-A detailed overview of the necessary input files and expected output files can be found [here](example).
+A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.
 
 
 ## **Inputs**
 * **MINI-AC mode**: genome-wide or locus-based.
-* **Species**: Arabidopsis or maize.
-* **ACR files**: BED files containing genomic coordinates corresponding to accessible chromatin regions (minimal format of 3 columns: chromosome, start, stop). The ACR files' coordinates **must** correspond to the genome versions of Araport11 for Arabidopsis and AGPv4 for maize.
+* **Species**: Arabidopsis or maize (maize genome version 4 or 5).
+* **ACR files**: BED files containing genomic coordinates corresponding to accessible chromatin regions (minimal format of 3 columns: chromosome, start, stop). The ACR files' coordinates **must** correspond to the genome versions of Araport11 for Arabidopsis and B73 RefGen_v4 and B73 RefGen_v5 for maize.
 * **Output folder**: Path where the results will be stored.
 * **(Optional) DEGs file**: Tab-separated txt file with differential expression data associated with the input ACRs. The only format requirements are that the first row has to be the header (column names), and the first column has to contain gene IDs. There is no requirement for the number of columns or content, although it should contain statistics associated to a DE analysis.
 * **(Optional) Expressed genes file**: One-column txt file with gene IDs for genes expressed in the biological context of the input ACRs, to filter the inferred GRNs.
@@ -41,12 +41,22 @@ The pipeline will run in parallel for multiple ACR BED input files. The two opti
 * [Singularity](https://sylabs.io/guides/3.0/user-guide/index.html)
 * [wget](https://www.gnu.org/software/wget/)
 * Motif mapping files. They need to be downloaded by executing the following commands on the **top-level directory of the repository**:
+
+  For Arabidopsis
 
   ```
   wget https://zenodo.org/record/7974527/files/ath_genome_wide_motif_mappings.bed?download=1 -O data/ath/ath_genome_wide_motif_mappings.bed
   wget https://zenodo.org/record/7974527/files/ath_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/ath/ath_locus_based_motif_mappings_5kbup_1kbdown.bed
-  wget https://zenodo.org/record/7974527/files/zma_genome_wide_motif_mappings.bed?download=1 -O data/zma/zma_genome_wide_motif_mappings.bed
-  wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/zma/zma_locus_based_motif_mappings_5kbup_1kbdown.bed
+  ```
+  For maize RefGen_v4
+  ```
+  wget https://zenodo.org/record/7974527/files/zma_genome_wide_motif_mappings.bed?download=1 -O data/zma_v4/zma_v4_genome_wide_motif_mappings.bed
+  wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 -O data/zma_v4/zma_v4_locus_based_motif_mappings_5kbup_1kbdown.bed
+  ```
+  For maize RefGen_v5
+  ```
+  wget https://zenodo.org/record/8386283/files/zma_v5_genome_wide_motif_mappings.bed?download=1 -O data/zma_v5/zma_v5_genome_wide_motif_mappings.bed
+  wget https://zenodo.org/record/8386283/files/zma_v5_locus_based_motif_mappings_5kbup_1kbdown.bed?download=1 data/zma_v5/ -O zma_v5_locus_based_motif_mappings_5kbup_1kbdown.bed
   ```
 
 NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10.6, Singularity version 3.8.7-1.el7 and in a Sun Grid Engine (SGE) computer cluster.
@@ -57,7 +67,7 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10
 Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:
 
 ```shell
-nextflow -C mini_ac.config run mini_ac.nf --mode genome_wide --species maize
+nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
 ```
 
 Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).
@@ -71,7 +81,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an
 
 When publishing results generated using MINI-AC, please cite:
 
-Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” bioRxiv, May 26, 2023. https://doi.org/10.1101/2023.05.26.542269.
+Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.
 
 ## Contact
 

diff --git a/bin/add_go_names.py b/bin/add_go_names.py
@@ -4,7 +4,8 @@
 import go_manipulations
 
 gene_go_file = argv[1]
+ontology_file = argv[2]
 
-go_tree = go_manipulations.GOtree(path.join(path.dirname(path.dirname(argv[0])), "ontologies", "go.obo"))
+go_tree = go_manipulations.GOtree(ontology_file)
 
 go_tree.add_descriptions(gene_go_file)
diff --git a/bin/getGO_xlsx_gw.py b/bin/getGO_xlsx_gw.py
@@ -34,7 +34,7 @@ def parseArgs():
     parser.add_argument('-ex', '--expressed_genes_file', nargs = 1, type = str,
                         default = None, help = '',
                         metavar = 'List of genes expressed in biological context of experiment')
-
+    
     args = parser.parse_args()
 
     return args
@@ -94,8 +94,11 @@ def parseArgs():
 
 if not GO_info:
     empty_table = pd.DataFrame(["### This dataset did not yield any GO enrichment"])
-    with pd.ExcelWriter(output_file) as writer:
-        empty_table.to_excel(writer, index = False, header = False)
+    if(output_file.endswith('.csv')):
+        empty_table.to_csv(output_file, index = False, header = False)
+    else:
+        with pd.ExcelWriter(output_file) as writer:
+            empty_table.to_excel(writer, index = False, header = False)
     sys.exit()
 
 ### Integrating data ###
@@ -130,5 +133,8 @@ def parseArgs():
 
 ### Writing output file ###
 
-with pd.ExcelWriter(output_file) as writer:
-    go_df.to_excel(writer, index = False)
+if (output_file.endswith('.csv')):
+    go_df.to_csv(output_file, index = False)
+else:
+    with pd.ExcelWriter(output_file) as writer:
+        go_df.to_excel(writer, index = False)
diff --git a/bin/getGO_xlsx_lb.py b/bin/getGO_xlsx_lb.py
@@ -34,7 +34,7 @@ def parseArgs():
     parser.add_argument('-ex', '--expressed_genes_file', nargs = 1, type = str,
                         default = None, help = '',
                         metavar = 'List of genes expressed in biological context of experiment')
-
+    
     args = parser.parse_args()
 
     return args
@@ -94,8 +94,11 @@ def parseArgs():
 
 if not GO_info:
     empty_table = pd.DataFrame(["### This dataset did not yield any GO enrichment"])
-    with pd.ExcelWriter(output_file) as writer:
-        empty_table.to_excel(writer, index = False, header = False)
+    if(output_file.endswith('.csv')):
+        empty_table.to_csv(output_file, index = False, header = False)
+    else:
+        with pd.ExcelWriter(output_file) as writer:
+            empty_table.to_excel(writer, index = False, header = False)
     sys.exit()
 
 ### Integrating data ###
@@ -130,5 +133,8 @@ def parseArgs():
 
 ### Writing output file ###
 
-with pd.ExcelWriter(output_file) as writer:
-    go_df.to_excel(writer, index = False)
+if (output_file.endswith('.csv')):
+    go_df.to_csv(output_file, index = False)
+else:
+    with pd.ExcelWriter(output_file) as writer:
+        go_df.to_excel(writer, index = False)
diff --git a/bin/getMotifCentricOutput_gw.py b/bin/getMotifCentricOutput_gw.py
@@ -104,8 +104,11 @@ def parseArgs():
 
 if enr_stats.empty:
     empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
-    with pd.ExcelWriter(output_file) as writer:
-        empty_table.to_excel(writer, index = False, header = False)
+    if(output_file.endswith('.csv')):
+        empty_table.to_csv(output_file, index = False, header = False)
+    else:
+        with pd.ExcelWriter(output_file) as writer:
+            empty_table.to_excel(writer, index = False, header = False)
     sys.exit()
 
 for col in enr_stats.select_dtypes(include = ['float']).columns:
@@ -122,11 +125,11 @@ def parseArgs():
 if expressed_genes_file:
     enr_stats['Any expressed gene'] = enr_stats.gene_id.isin(exp_genes)
 
-    enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
+    enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
 
 if not expressed_genes_file:
 
-    enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
+    enr_stats = enr_stats.groupby(['dataset', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
 
 enr_stats = enr_stats.merge(mot_tf, how = 'right', left_on = 'motif', right_on = 'motif_id').drop('motif_id', axis = 1)
 
@@ -146,5 +149,8 @@ def parseArgs():
 
 ### Writing output file ###
 
-with pd.ExcelWriter(output_file) as writer:
-    enr_stats.to_excel(writer, index = False)
+if (output_file.endswith('.csv')):
+    enr_stats.to_csv(output_file, index = False)
+else:
+    with pd.ExcelWriter(output_file) as writer:
+        enr_stats.to_excel(writer, index = False)
diff --git a/bin/getMotifCentricOutput_lb.py b/bin/getMotifCentricOutput_lb.py
@@ -104,8 +104,11 @@ def parseArgs():
 
 if enr_stats.empty:
     empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
-    with pd.ExcelWriter(output_file) as writer:
-        empty_table.to_excel(writer, index = False, header = False)
+    if(output_file.endswith('.csv')):
+        empty_table.to_csv(output_file, index = False, header = False)
+    else:
+        with pd.ExcelWriter(output_file) as writer:
+            empty_table.to_excel(writer, index = False, header = False)
     sys.exit()
 
 for col in enr_stats.select_dtypes(include = ['float']).columns:
@@ -122,11 +125,11 @@ def parseArgs():
 if expressed_genes_file:
     enr_stats['Any expressed gene'] = enr_stats.gene_id.isin(exp_genes)
 
-    enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif','real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
+    enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif','real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x))), 'Any expressed gene': any}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
 
 if not expressed_genes_file:
 
-    enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(list(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
+    enr_stats = enr_stats.groupby(['dataset', 'input_total_peaks', 'peaks_in_promoter', 'motif', 'real_int', 'shuffled_int', 'p_val', 'enr_fold', 'adj_pval', 'pi_value', 'rank_pi_val']).agg({'gene_id': ','.join, 'family': lambda x: ','.join(sorted(set(x)))}).reset_index().sort_values(by = 'rank_pi_val').drop('gene_id', axis = 1)
 
 enr_stats = enr_stats.merge(mot_tf, how = 'right', left_on = 'motif', right_on = 'motif_id').drop('motif_id', axis = 1)
 
@@ -146,5 +149,8 @@ def parseArgs():
 
 ### Writing output file ###
 
-with pd.ExcelWriter(output_file) as writer:
-    enr_stats.to_excel(writer, index = False)
+if (output_file.endswith('.csv')):
+    enr_stats.to_csv(output_file, index = False)
+else:
+    with pd.ExcelWriter(output_file) as writer:
+        enr_stats.to_excel(writer, index = False)
diff --git a/bin/getTFCentricOutput_gw.py b/bin/getTFCentricOutput_gw.py
@@ -133,8 +133,11 @@ def parseArgs():
 
 if enr_stats.empty:
     empty_table = pd.DataFrame(["### This dataset did not yield any motif enrichment"])
-    with pd.ExcelWriter(output_file) as writer:
-        empty_table.to_excel(writer, index = False, header = False)
+    if(output_file.endswith('.csv')):
+        empty_table.to_csv(output_file, index = False, header = False)
+    else:
+        with pd.ExcelWriter(output_file) as writer:
+            empty_table.to_excel(writer, index = False, header = False)
     sys.exit()
 
 ### Reading and processing GO enrichment data ###
@@ -261,5 +264,9 @@ def parseArgs():
 
 ### Writing output file ###
 
-with pd.ExcelWriter(output_file) as writer:
-    enr_stats.to_excel(writer, index = False)
+if (output_file.endswith('.csv')):
+    enr_stats.to_csv(output_file, index = False)
+else:
+    with pd.ExcelWriter(output_file) as writer:
+        enr_stats.to_excel(writer, index = False)
+