- "Currently, ChIA-DropBox is under review,
- and will be opened to public after we published our paper.
- For the test purpose,
- users should download the code directly from the browser after logging in using provided ID and password.
- "git clone or wget" will work when we open ChIA-DropBox to public."
- We have added scripts for GEO SRRxxx fastq file.
- "Currently, ChIA-DropBox is under review,
- and will be opened to public after we published our paper.
- For the test purpose,
- users should download the code directly from the browser after logging in using provided ID and password.
- "git clone or wget" will work when we open ChIA-DropBox to public."
- We have added scripts for GEO SRRxxx fastq file.
---- a novel analysis and visualization pipeline for multiplex chromatin interactions
Simon Zhongyuan Tian, Daniel Capurso, Minji Kim, Byoungkoo Lee, Meizhen Zheng, Yijun Ruan
Recently, we developed ChIA-Drop1, a novel experimental method for detecting multiplex chromatin interactions with single-molecule precision via droplet-based and barcode-linked sequencing. ChIA-DropBox a novel toolkit for analyzing and visualizing multiplex chromatin interactions, which includes: a ChIA-DropBox data processing pipeline2 and a visualizing tool ChIA-View3.
1 Meizhen Zheng, Simon Zhongyuan Tian, Daniel Capurso, Minji Kim, Rahul Maurya, Byoungkoo Lee, Emaly Piecuch et al. "Multiplex chromatin interactions with single-molecule precision." Nature 566, 558 (2019). & GSE109355
2 https://github.com/TheJacksonLaboratory/ChIA-DropBox.git
3 https://github.com/TheJacksonLaboratory/ChIA-view.git
This pipeline was developed and executed in Centos 6.5 of HPC (high performance computing).
Download 10X Genomeics longranger pipeline, and then install it according to its Instruction. We have tested longranger v1.x, which could be download from: https://support.10xgenomics.com/genome-exome/software/downloads/1.x/
Before running ChIA-dropbox pipeline, we need to prepare genome reference according to 10X Genomics guide. Current ChIA-DropBox can be used for these version genomes: dm3, dm6, hg19, hg38, mm9, mm10.
mkdir reference_genome/
cd reference_genome/
ls
refdata-dm3/
refdata-hg19-2.1.0/
refdata-GRCh38-2.1.0/
...
For users, we suggest 3 data soures. Testing data: (Miseq) are some small size data for pipeline testing. ChIA-Drop Production data: are data releaseb by our Natuure paper. User Data: are data generageted by user-self.
RNAP2 ChIA-Drop: SHG0051 & ChIA-Drop: SHG0061 & Pure-DNA ChIA-Drop: SHG0041
are aviable @ https://www.dropbox.com/sh/klb1hwprl66qced/AAAGEE9e5SumVEK31jCN4NTKa?dl=0
Please rename the SRRxxx FASTQs to following psydo-name, which format is required by longranger pipeline:
RNAP2 ChIA-Drop: SHG0051H @ SRR7722067
MCP.FASTQ/
mv SRR7722067.1.fastq SHG0051H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq.gz
mv SRR7722067.2.fastq SHG0051H_GT18-08809_SI-GA-B5_S20_L004_R2_001.fastq.gz
ChIA-Drop: SHG0061H @ SRR7722059
cd MCP.FASTQ/
mv SRR7722059.1.fastq SHG0061H_GT18-08817_SI-GA-B10_S28_L006_R1_001.fastq.gz
mv SRR7722059.1.fastq SHG0061H_GT18-08817_SI-GA-B10_S28_L006_R2_001.fastq.gz
If user want to generate fastq files by yourself, please use longranger mkfastq.
MCP.FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
MCP.FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
Following tools and packages are necessary for ChIA-DropBox data processing pipeline.
module load longranger/2.1.5 (or later version);
module load python/2.7.13 (packages: os, sys, subprocess, operator, collections,time, pysam, pybedtools, glob)
module load samtools/1.8;
module load java/1.8.0
module load R/3.4.4 (packages: dplyr, ggplot2, gridExtra, scales, sitools)
module load BEDtools/2.26.0
module load ImageMagick/7.0.7-26
module load juicer/1.7.5
# make allies for juicer:
juicer1="/opt/compsci/juicer/1.7.5/CPU/common/"
juicer2="/opt/compsci/juicer/1.7.5/"
ChIA-dropbox pipeline automaticly processes data accoding to "library name". After you have prepared above praparation, please create a folder in you HPC system. The folder name should stricly following following strategry:
-
If the library was sequenced by Illumina Miseq, the folder named with 3 letters + 4 digits (e.g. SHG0024)
-
If the library was sequenced by Illumina Hiseq, the folder named with 3 letters + 4 digits + "H" (e.g. SHG0024H)
-
If the library was sequenced by Illumina Nextseq, the folder named with 3 letters + 4 digits + "N" (e.g. SHG0024N)
-
If the library was sequenced by Illumina NovaSeq, the folder named with 3 letters + 4 digits + "V" (e.g. SHG0024V)
For example, regarding to the example fastq files above (1.3), we should create a folder named: SHG0055H
Download ChIA-dropbox pipeline to you HPC system from github to the library folder created in 2.1 (e.g. copy all files into SHG0055H/)
for example:
git clone https://github.com/TheJacksonLaboratory/ChIA-DropBox.git
ls ChIA-dropbox-v1.0/
mkdir SHG0055H
cd SHG0055H/
cp -rf ../ChIA-dropbox-v1.0/MCP* .
## FASTQ
cd SHG0055H/MCP.FASTQ/
#!! please copy fastq file here :
cp /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
cp /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
## or soft link:
ln -s /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
ln -s /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
## Reference genome
cd SHG0055H/
ln -s /projects/ruan-lab/Sequence_pool/10X/REFGNM10X/ MCP.REF
## detail see: 1.2 Prepare reference genome
## Fastq type (input fasteq file type: from GEO SRR type or generate by lingrange mkfastq)
## This script will tansform the name of fastq, which is from SRR compressed.
cd SHG0055H/
sh MCP_RUN00_prepare_data.sh
#This data is download from GEO (SRR...) or NOT? (SRR/NOT):
SRR
ls MCP.FASTQ/*
MCP.FASTQ/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
MCP.FASTQ/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
We integerate a script called MCP_RUN01_initiate_parameters.sh, via which we could setup all necessary parameters for this ChIA-dropbox pipeline. We only need to follow the questions and answer each of them acoording to its recommanded answer, see following example please.
$ cd /path-to-data-processing/SHG0055H/
$ sh MCP_RUN01_initiate_parameters.sh
################
Library name: SHG0055H
~~~~~~~~~~
SHG0055H
################
qsub ID: 1055
~~~~~~~~~~
1055
##################
1055
Longranger .mro LIB-ID (must be all digits): 1055
1055
~~~~~~~~~~
1055
##################
folder of fastq: /path-to-data-processing/SHG0055H/FASTQ/
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
~~~~~~~~~~
/path-to-data-processing/SHG0055H/FASTQ/
SHG0055H_GT18-19704_SI-GA-H11_S22_L001_I1_001.fastq.gz SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
###################
EXAMPLE: SHG0066_GT18-06593_SI-GA-A1_S5_L001_I1_001.fastq.gz
fastq prefix: SHG0055H_GT18-19704_SI-GA-H11
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
~~~~~~~~~~
SHG0055H_GT18-19704_SI-GA-H11
###################
cpu number required: 16
~~~~~~~~~~
16
###################
memory size required: 150gb
~~~~~~~~~~
150gb
###################
pipeline running time requierd: 25
~~~~~~~~~~
25
###################
fragment extention size (0, 3000, 5000 ...): 3000
~~~~~~~~~~
3000
###################
execute mark: 20190116-142816
~~~~~~~~~~
20190116-142816
###################
cell gender (m=male; f=female) : m
~~~~~~~~~~
m
###################
genome reference (hg19, hg38, mm9, mm10, dm3, dm6, dm3hg19, hg19dm3) : dm3
genome size file: /path-to-data-processing/SHG0055H/MCP.pc//dm3_len_6CHROM.txt
~~~~~~~~~~
/path-to-data-processing/SHG0055H/MCP.pc//dm3_len_6CHROM.txt
###################
/path-to-data-processing/SHG0055H/REF/refdata-dm3
genome reference directory: /path-to-data-processing/SHG0055H/REF/refdata-dm3
~~~~~~~~~~
/path-to-data-processing/SHG0055H/REF/refdata-dm3
###################
After running MCP_RUN01_initiate_parameters.sh, there will occur two new configuration files for this library: MCP.conf and MCP.mro.
MCP_RUN02_exec_pipeline_from_to.sh is the script to select pipeline steps and generate qsub file (HPC job file).
$sh MCP_RUN02_exec_pipeline_from_to.sh
CDB0 = STEP0000_MAPPING.run
CDB1 = STEP0001_MKDOO.run
CDB2 = STEP0101_GETBC.run
...
CDB37 = STEP2001_plot.sh
CDB38 = STEP2002_combine_image.run
from
0
TO
38
SHG0055H
B/S (B: big resource is needed; S: small resource is needed)
B
#! /bin/bash
#PBS -l nodes=1:ppn=16,mem=150gb,walltime=25:00:00
#PBS -N SHG0055H.0.38.dm3
cd $PBS_O_WORKDIR
FROM=0
TO=38
OUTSIDE=`pwd`
source ${OUTSIDE}/MCP.conf
source ${OUTSIDE}/MCP.module
MCPPC.PART.0-38.qsub
qsub MCPPC.PART.0-38.qsub
This script (MCP_RUN02_exec_pipeline_from_to.sh) will create a qsub file MCPPC.PART.0-38.qsub, which runs the pipeline from very beginning (CDB0 = STEP0000_MAPPING.run) to the end (CDB38 = STEP2002_combine_image.run). We can also manually select steps or step to execute (by select FROM and TO):
$sh MCP_RUN02_exec_pipeline_from_to.sh
CDB0 = STEP0000_MAPPING.run
CDB1 = STEP0001_MKDOO.run
...
CDB37 = STEP2001_plot.sh
CDB38 = STEP2002_combine_image.run
from
12
TO
23
SHG0055H
B/S (Big=this library defined qsub resource needed; Small=default qsub)
S
#! /bin/bash
#PBS
#PBS -N SHG0055H.12.23.dm3
cd $PBS_O_WORKDIR
FROM=12
TO=23
OUTSIDE=`pwd`
source ${OUTSIDE}/MCP.conf
source ${OUTSIDE}/MCP.module
MCPPC.PART.12-23.qsub
qsub MCPPC.PART.12-23.qsub
Using the command generated in 2.3.3, we can submit the job to the HPC:
$ qsub MCPPC.PART.12-23.qsub
$ qstat -u stian
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
9093578.helix-master stian batch SHG0055H.12.23.dm 7638 1 1 -- 01:00:00 R 00:00:02
$ qsub MCPPC.PART.0-38.qsub
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
9093579.helix-master stian batch SHG0055H.0.38.dm3 -- 1 16 150gb 25:00:00 Q --
Commonly, for a 10 M reads library, qsub executive time is ~ 5 hours.
- /path-to-data-processing/SHG0055H/SHG0055H.0-38.20190114-122447.log: recorded all steps log.
- /path-to-data-processing/SHG0055H/SHG0055H.0.38.dm3.o9090159: reports qsub executive information.
ChIA-dropbox generates statistices informatin in /path-to-data-processing/SHG0055H/SHG0055H/D12_REPORT/.
$ cat /path-to-data-processing/SHG0055H/SHG0055H/D12_REPORT/SHG0017_sta.csv
SHG0055H lib
0.010103762232363806 pcr_duplication
8142864 Total_PET_(Paired_end_reads)
2073187 PET_with_BC_(BarCode)
1704288 Uniq_Mappable_reads (R1)
267914 Reads_mapq>=30
97141 Reads_q>=30_&_len>=50bp
92071 Reads_q>=30_&_len>=50bp_&_3'ext500
92071 Reads_q>=30_&_len>=50bp_&_3'ext500_&_MAJOR_CHROM
304151 BC_count_of_PET-w/-BC_(BarCode)
291621 BC_of_Uniq_Mappable_reads_(R1)
43743 BC_of_Reads_q>=30_&_len>=50bp_&_3'ext500_&_MAJOR_CHROM
43743 GEM_TOTAL
41562 GEM_singleFrag
2181 GEM_multiFrags
296 GEM_of_intra_chrom
1885 GEM_of_inter_chrom
574 GEM_of_inter_chrom_divby_chrom_has_multiFrags
5751 GEM_of_inter_chrom_divby_chrom_has_singleFrag
48183 GEM_of_TOTAL_SA_[SingleFrag+IntraFrags]
47313 GEM_with_Frags=1
779 GEM_with_Frags=2
72 GEM_with_Frags=3
14 GEM_with_Frags=4
4 GEM_with_Frags=5
...
ChIA-dropbox prepared multiple formats of output files to meet different requirments.
- .PlinePgemSimp --- Per line per GEM with fragments coodinates.
GEM_ID | GEM_coordinate | GEM_span | Fragment_number | List_of_fragment_coordinates(read count/fragment) |
---|---|---|---|---|
SHG0055H-10000046-AAACACCAGTAACGATBX1-FA-1-0 | chrX:16274973-16286464 | 11491 | 2 | chrX:16274973-16275587(0);chrX:16285880-16286464(1) |
SHG0055H-10000116-AAACACCGTACAGAGCBX1-FA-1-0 | chrX:10798609-18731966 | 7933357 | 2 | chrX:10798609-10799202(0);chrX:18731338-18731966(1) |
- .region(.rds) --- in ChIA-view to see linear alignment and clustering view for multiple fragments. In this file, per line represents per fragment; fragments are of a same complex owning same GEM-ID.
Chrom | Start | End | Frag_num/GEM | GEM_ID |
---|---|---|---|---|
chrX | 16274973 | 16275587 | 2 | SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 |
- .region.PEanno(.rds) --- in ChIA-view to see linear alignment and clustering view for multiple fragments, with annoted fragments by promoter or num-promoter.
Chrom | Start | End | Frag_num/GEM | GEM_ID | PEanno |
---|---|---|---|---|---|
chrX | 16274973 | 16275587 | 2 | SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 | P |
chrX | 16285880 | 16286464 | 2 | SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 | E |
- .gff --- view multiple fragments linearly in IGV as a pysudo gene track (large data may cause IGV slow).
-
.hic --- 2D heatmap file for Juicebox
-
.bedpe_PlinePpair --- Convert all fragments of a same GEMcode to pair-end interaction (PET).
Chrom1 | Start1 | End1 | Chrom2 | Start2 | End2 | Frag_num/GEM | GEM_ID |
---|---|---|---|---|---|---|---|
chrX | 16274973 | 16275587 | chrX | 16285880 | 16286464 | 2 | SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 |
chrX | 10798609 | 10799202 | chrX | 18731338 | 18731966 | 2 | SHG0055H-1000-10000116-AAACACCGTACAGAGCBX1-M02838-FA-1-0 |
- .rcov.bedgraph --- Coverage file generated from quilified reads
- .fcov.bedgraph --- Coverage file generated from fragments
- .acov.bedgraph --- Coverage file generated from PET anchors (without remove duplicated anchors).
Here export ChIA-dropbox QC plots:
/path-to-data-processing/SHG0055H/SHG0055H/D20_PLOT/SHG0055H.QCPLOT.AIN1.png