Skip to content

RNASeq II

Meg Staton edited this page Nov 8, 2022 · 8 revisions

Follow up from last time... what happened to your read mapping? Mine was killed due to time. We'll need to give it more time for this lab.

STAR isn't just a mapper, it can also count reads. But we'll need a gtf instead of a gff. (Thank you biostars!)

GFF to GTF and reindex

Lets explore the process of converting a gff to a gtf. (I've already done this in raw_data, you don't have to)

../software/gffread-0.12.7.Linux_x86_64/gffread -E Ppersica_298_v2.1.gene_exons.gff3 -T -o Ppersica_298_v2.1.gene_exons.gtf
Ppersica_298_v2.1.gene_exons.gtf
Ppersica_298_v2.1.gene_exons.gff3
grep -c '\sgene\s' Ppersica_298_v2.1.gene_exons.gff3
grep -c '\smRNA\s' Ppersica_298_v2.1.gene_exons.gff3

Now we need to reindex our genome file to make sure STAR can use the gtf annotations in the mapping and counting steps. here is the new indexing script, which needs to be submitted to a compute node with sbatch

#!/bin/bash
#SBATCH -J fastqc
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --mem-per-cpu=16G

module load star

STAR \
--runMode genomeGenerate \
--genomeDir STAR_gtf_idx \
--genomeFastaFiles Ppersica_298_v2.0.fa \
--runThreadN 1 \
--genomeSAindexNbases 11 \
--sjdbGTFfile Ppersica_298_v2.1.gene_exons.gtf \
--sjdbOverhang 149

Note change to gtf file, removal of --sjdbGTFtagExonParentTranscript Parent, and new directory name

STAR - map and count one paired-end read set

Lets return to our analysis directory

/lustre/isaac/proj/UTK0208/rnaseq/analysis/YOURUSERID/2_star

Copy your script from last time and lets see the count output

cp star.qsh starCount.qsh

Edit starCount.qsh:

#!/bin/bash
#SBATCH -J star-map
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 02:00:00
#SBATCH --mem-per-cpu=8G

module load star

time \
STAR \
--genomeDir /lustre/isaac/proj/UTK0208/rnaseq/raw_data/STAR_gtf_idx \
--runThreadN 8 \
--readFilesIn EarlyBlommingRep1_1.fastq.gz EarlyBlommingRep1_2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix EarlyBlommingRep1 \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts

Note new genomeDir location and new quantMode flag, also more cores and time. Submit with sbatch.

Email!

It would be convenient if the computer let us know when the tasks are complete. We can ask it to send us emails for when a job begins, when it ends, and if it fails.

Dummy job, put in email.qsh:

#!/bin/bash
#SBATCH -J email
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:01:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

echo "All done! Send me an email."

Task Arrays

We've got EarlyBlooming1 rolling along with no problems (hopefully), but what about the other three? Lets create a task array. Start by putting the forward read file names in a new file:

ls *1.fastq.gz > filenames.txt

You can check it worked with nano or cat. I'm also going to remove early blooming 1 since we've already got it running. Final file should look like this:

EarlyBloomingRep2_1.fastq.gz
LateBloomingRep1_1.fastq.gz
LateBloomingRep2_1.fastq.gz

Now tweaking the submission script. First, we'll use sed to take the number provided from ${SLURM_ARRAY_TASK_ID} and pull out the corresponding line number in filenames.txt. Next, we'll use sed again to pull out the same name (using a regex and the input filename), which will allow us to create the reverse file name and the output file name.

cp starCount.qsh starCountTA.qsh

Now edit starCountTA.qsh


#!/bin/bash
#SBATCH -J star-mapTA
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH -A ISAAC-UTK0208
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 02:00:00
#SBATCH --mem-per-cpu=8G
#SBATCH --array=1-3
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

module load star

readonefile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)
echo "readonefile is $readonefile"

readtwofile=$(echo $readonefile | sed 's/_1.fastq.gz/_2.fastq.gz/')
echo "readtwofile is $readtwofile"

outprefix=$(echo $readonefile | sed 's/_1.fastq.gz//')
echo "outprefix is $outprefix"

time \
STAR \
--genomeDir /lustre/isaac/proj/UTK0208/rnaseq/raw_data/STAR_gtf_idx \
--runThreadN 8 \
--readFilesIn $readonefile $readtwofile \
--readFilesCommand zcat \
--outFileNamePrefix $outprefix \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts

Submit and see if it worked.