-
Notifications
You must be signed in to change notification settings - Fork 1
RNASeq II
Follow up from last time... what happened to your read mapping? Mine was killed due to time. We'll need to give it more time for this lab.
STAR isn't just a mapper, it can also count reads. But we'll need a gtf instead of a gff. (Thank you biostars!)
Lets explore the process of converting a gff to a gtf. (I've already done this in raw_data, you don't have to)
../software/gffread-0.12.7.Linux_x86_64/gffread -E Ppersica_298_v2.1.gene_exons.gff3 -T -o Ppersica_298_v2.1.gene_exons.gtf
Ppersica_298_v2.1.gene_exons.gtf
Ppersica_298_v2.1.gene_exons.gff3
grep -c '\sgene\s' Ppersica_298_v2.1.gene_exons.gff3
grep -c '\smRNA\s' Ppersica_298_v2.1.gene_exons.gff3
Now we need to reindex our genome file to make sure STAR can use the gtf annotations in the mapping and counting steps. here is the new indexing script, which needs to be submitted to a compute node with sbatch
#!/bin/bash
#SBATCH -J fastqc
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --mem-per-cpu=16G
module load star
STAR \
--runMode genomeGenerate \
--genomeDir STAR_gtf_idx \
--genomeFastaFiles Ppersica_298_v2.0.fa \
--runThreadN 1 \
--genomeSAindexNbases 11 \
--sjdbGTFfile Ppersica_298_v2.1.gene_exons.gtf \
--sjdbOverhang 149
Note change to gtf file, removal of --sjdbGTFtagExonParentTranscript Parent, and new directory name
Lets return to our analysis directory
/lustre/isaac/proj/UTK0208/rnaseq/analysis/YOURUSERID/2_star
Copy your script from last time and lets see the count output
cp star.qsh starCount.qsh
Edit starCount.qsh:
#!/bin/bash
#SBATCH -J star-map
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 02:00:00
#SBATCH --mem-per-cpu=8G
module load star
time \
STAR \
--genomeDir /lustre/isaac/proj/UTK0208/rnaseq/raw_data/STAR_gtf_idx \
--runThreadN 8 \
--readFilesIn EarlyBlommingRep1_1.fastq.gz EarlyBlommingRep1_2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix EarlyBlommingRep1 \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts
Note new genomeDir location and new quantMode flag, also more cores and time. Submit with sbatch
.
It would be convenient if the computer let us know when the tasks are complete. We can ask it to send us emails for when a job begins, when it ends, and if it fails.
Dummy job, put in email.qsh:
#!/bin/bash
#SBATCH -J email
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:01:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
echo "All done! Send me an email."
We've got EarlyBlooming1 rolling along with no problems (hopefully), but what about the other three? Lets create a task array. Start by putting the forward read file names in a new file:
ls *1.fastq.gz > filenames.txt
You can check it worked with nano or cat. I'm also going to remove early blooming 1 since we've already got it running. Final file should look like this:
EarlyBloomingRep2_1.fastq.gz
LateBloomingRep1_1.fastq.gz
LateBloomingRep2_1.fastq.gz
Now tweaking the submission script. First, we'll use sed to take the number provided from ${SLURM_ARRAY_TASK_ID} and pull out the corresponding line number in filenames.txt. Next, we'll use sed again to pull out the same name (using a regex and the input filename), which will allow us to create the reverse file name and the output file name.
cp starCount.qsh starCountTA.qsh
Now edit starCountTA.qsh
#!/bin/bash
#SBATCH -J star-mapTA
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 02:00:00
#SBATCH --mem-per-cpu=8G
#SBATCH --array=1-3
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
module load star
readonefile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)
echo "readonefile is $readonefile"
readtwofile=$(echo $readonefile | sed 's/_1.fastq.gz/_2.fastq.gz/')
echo "readtwofile is $readtwofile"
outprefix=$(echo $readonefile | sed 's/_1.fastq.gz//')
echo "outprefix is $outprefix"
time \
STAR \
--genomeDir /lustre/isaac/proj/UTK0208/rnaseq/raw_data/STAR_gtf_idx \
--runThreadN 8 \
--readFilesIn $readonefile $readtwofile \
--readFilesCommand zcat \
--outFileNamePrefix $outprefix \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts
Submit and see if it worked.