Merge pull request #23 from VIB-PSB/dev

iCREs-based GRN inference feature
VIB-PSB · May 24, 2024 · 04f1a87 · 04f1a87
2 parents 5193b83 + 3736ff6
commit 04f1a87
Show file tree

Hide file tree

Showing 17 changed files with 2,618 additions and 31 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,13 @@ singularity_cache/
 # ignore large motif mapping files
 *motif_mappings*.bed
 
+# ignore icres files (too large for repo)
+*icres*.bed
+
+# ignore iCREs output files (confidentiality until publication)
+example/outputs_icres/*
+!example/outputs_icres/.gitkeep
+
 # ignore nf-test executable
 nf-test
 
@@ -18,8 +25,8 @@ nf-test
 tests/outputs/
 
 # ignore SLURM output and error files
-slurm.*.out
-slurm.*.err
+slurm*.out
+slurm*.err
 
 # ignore jupyter notebook checkpoints
 .ipynb_checkpoints/

diff --git a/README.md b/README.md
@@ -13,12 +13,10 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr
 
 Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
 * **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
-* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/configuration_pipeline.md).
-
+* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/pipeline_configuration.md).
 
 A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.
 
-
 ## **Inputs**
 * **MINI-AC mode**: genome-wide or locus-based.
 * **Species**: Arabidopsis or maize (maize genome version 4 or 5).
@@ -63,15 +61,31 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10
 
 ## Usage
 
-
-Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:
+Define the paths with the input files and the desired parameters setting in the [configuration file](docs/pipeline_configuration.md), and run it executing the following Nextflow command:
 
 ```shell
 nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
 ```
 
 Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).
 
+## iCREs-based MINI-AC [NOT AVAILABLE UNTIL PUBLICATION]
+
+Given the amount of resources available to profile regulatory DNA in maize, we curated a collection of integrated cis-regulatory elements (iCREs) by combining and comparing different CRE-profiling methods (details to be published).
+
+We implemented a new framework in which it is possible to run MINI-AC given a list of maize genes. It works by retrieving the genomic coordinates of the iCREs associated with genes of interest, and submitting them to motif enrichment and GRN inference using the genome-wide mode of MINI-AC. iCREs-based MINI-AC can only be run for maize, and not for Arabidopsis. In addition, we offer different sets of iCREs that are used in the run: the "maxF1" (`maxf1`) set or the "all" (`all`) set. The first uses a set of putative CREs that is smaller but more precise (less false positives), while the second uses a more comprehensive and complete collection of maize putative CREs.
+
+To download files with the genomic coordinates of the iCREs, the following commands should be executed on the **top-level directory of the repository**:
+
+```shell
+NOT AVAILABLE UNTIL PUBLICATION
+```
+
+To run iCREs-based MINI-AC, the [configuration file](./mini_ac_icres.config) should be prepared as explained [here](./docs/pipeline_configuration.md). Only two parameters change in comparison to the regular MINI-AC runs. Instead of providing a BED file with ACR genomic coordinates, a list of gene IDs from the maize genome version V4 or V5 should be provided, as exemplified [here](./example/inputs/gene_set_files/UP_gene_set.txt). In addition, an iCREs set should be specified (`maxf1` or `all`). Next, the following Nextflow command should be executed:
+
+```shell
+nextflow -C mini_ac_icres.config run mini_ac_icres.nf --icres_set <all|maxf1> --species <maize_v4|maize_v5>
+```
 
 ## Support
 
@@ -81,7 +95,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an
 
 When publishing results generated using MINI-AC, please cite:
 
-Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.
+Nicolás Manosalva Pérez, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal 117, no. 1 (2024): 280–301. https://doi.org/10.1111/tpj.16483.
 
 ## Contact
 

diff --git a/bin/geneList2iCREs.py b/bin/geneList2iCREs.py
@@ -0,0 +1,51 @@
+# %%
+import argparse
+
+def parseArgs():
+
+    parser = argparse.ArgumentParser(prog = 'Script to get a BED file with iCREs ' + \
+                                            'coordinates given a list of genes',
+                        conflict_handler='resolve')
+
+    parser.add_argument('annotated_icres', type = str,
+                        help = '',
+                        metavar = 'BED file with 4th column being ' +\
+                                    'an annotated gene ID')
+
+    parser.add_argument('gene_list', type = str,
+                        help = '',
+                        metavar = 'One column file containing gene IDs '+ \
+                                'of interest')
+
+    parser.add_argument('bed_of_genes_icres', type = str,
+                        help = '',
+                        metavar = 'Output BED file with coordinates '+\
+                            'of iCREs associated with genes of interest')
+
+    args = parser.parse_args()
+
+    return args
+
+args = parseArgs()
+
+annot_icres = args.annotated_icres
+genes_oi_file = args.gene_list
+output_file = args.bed_of_genes_icres
+
+# %%
+genes_oi = set()
+
+with open(genes_oi_file, "r") as fin:
+    for line in fin:
+        rec = line.strip().split("\t")
+        gene_id = rec[0]
+        genes_oi.add(gene_id)
+
+with open(output_file, "w") as fout:
+    with open(annot_icres, "r") as fin:
+        for line in fin:
+            rec = line.strip().split("\t")
+            gene_id = rec[3]
+            if gene_id in genes_oi:
+                fout.write("\t".join(rec[0:3]))
+                fout.write("\n")
diff --git a/data/icres/.gitkeep b/data/icres/.gitkeep
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -2,7 +2,7 @@
 
 ## Q: MINI-AC failed, how can I fix it?
 A: 
-* Check the [config file](/docs/configuration_pipeline.md):
+* Check the [config file](/docs/pipeline_configuration.md):
   * Did you specify the correct [executor](https://www.nextflow.io/docs/latest/executor.html) (e.g. SGE, SLURM, ...)? Cluster-related options (i.e., all the lines starting with `clusterOptions`) should also be adapted to match the options of the selected executor.
   * Did you [specify to Singularity the path to the temporary directory](https://docs.sylabs.io/guides/3.5/user-guide/bind_paths_and_mounts.html)? It can be done by adjusting the parameter ```runOptions``` of singularity in Nextflow to ```--bind /absolute/path/to/tmp/folder```. To know the absolute path to the tmp folder in linux execute in the command line ```echo $TMPDIR```