DNASeqI Practical

Log in to the ACF and go to your personal project folder in the class directory.

Change made on 9/23/18

After class was over I realized the fastq-dump gets misformatted files (it became obvious during read mapping, when BWA couldn't identify proper pairs). So this lesson has been updated to reflect the new way of getting the files and the folder renamed to e_coli_fixed.

Set Up and Get Data

Start a project directory (inside your home directory)

mkdir e_coli_fixed
cd e_coli_fixed

Download data from ENA

mkdir raw_data
cd raw_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR021/DRR021342/DRR021342_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR021/DRR021342/DRR021342_2.fastq.gz

Get the reference genome as well (from NCBI this time), which will use to map the reads against

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz

Interactive session and examining files

Use qsub to get an interactive session.

qsub -I -A ACF-UTK0085 -q debug -l walltime=1:00:00,nodes=1

Check that you are on a compute node.

uname -a

Get back to your directory (e_coli/raw_data) and examine the files.

Unzip the ones that end in .gz (the command is gunzip)
Look at the number of lines
Look at the (human readable) file size
Try out head and tail
For the sequence files, how do you figure out how many sequences are in each file?
- For fasta, grep -c '^>' <filename>
- For fastq, counting the number of lines and dividing by 4

If this were actual data you didn't want to lose, how would you remove write permissions? Give it a try and then try to delete the data.

Quality Examination

We can use FASTQC to generate a report summarizing many quality statistics. First, lets keep our directory organized and set up a sensible structure.

cd ..
mkdir analysis
cd analysis
mkdir 1_fastqc
cd 1_fastqc

Load the software

module load fastqc

And run it

fastqc -t 2 -o . ../../raw_data/DRR021342_1.fastq
fastqc -t 2 -o . ../../raw_data/DRR021342_2.fastq

The -t flag specifies number of threads and the -o flag specifies where to put the output files.

Copy the fastqc results to your computer to check the quality with scp. Here's an example of an scp command, to be run from a terminal on your computer (not from the acf terminal). Replace both instances of username with your own username.

scp [email protected]:/lustre/haven/courses/EPP622-2018Fa/username/e_coli_fixed/analysis/1_fastqc/*html .

Open these files and see what kind of quality we have.

More info about random hexamer priming not being so random

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNASeqI Practical

Change made on 9/23/18

Set Up and Get Data

Interactive session and examining files

Quality Examination

Clone this wiki locally