-
Notifications
You must be signed in to change notification settings - Fork 12
DNASeqI Practical
Log in to the ACF and go to your personal project folder in the class directory.
After class was over I realized the fastq-dump gets misformatted files (it became obvious during read mapping, when BWA couldn't identify proper pairs). So this lesson has been updated to reflect the new way of getting the files and the folder renamed to e_coli_fixed.
Start a project directory (inside your home directory)
mkdir e_coli_fixed
cd e_coli_fixed
Download data from ENA
mkdir raw_data
cd raw_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR021/DRR021342/DRR021342_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR021/DRR021342/DRR021342_2.fastq.gz
Get the reference genome as well (from NCBI this time), which will use to map the reads against
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
Use qsub to get an interactive session.
qsub -I -A ACF-UTK0085 -q debug -l walltime=1:00:00,nodes=1
Check that you are on a compute node.
uname -a
Get back to your directory (e_coli/raw_data) and examine the files.
- Unzip the ones that end in .gz (the command is gunzip)
- Look at the number of lines
- Look at the (human readable) file size
- Try out head and tail
- For the sequence files, how do you figure out how many sequences are in each file?
- For fasta,
grep -c '^>' <filename>
- For fastq, counting the number of lines and dividing by 4
- For fasta,
If this were actual data you didn't want to lose, how would you remove write permissions? Give it a try and then try to delete the data.
We can use FASTQC to generate a report summarizing many quality statistics. First, lets keep our directory organized and set up a sensible structure.
cd ..
mkdir analysis
cd analysis
mkdir 1_fastqc
cd 1_fastqc
Load the software
module load fastqc
And run it
fastqc -t 2 -o . ../../raw_data/DRR021342_1.fastq
fastqc -t 2 -o . ../../raw_data/DRR021342_2.fastq
The -t flag specifies number of threads and the -o flag specifies where to put the output files.
Copy the fastqc results to your computer to check the quality with scp. Here's an example of an scp command, to be run from a terminal on your computer (not from the acf terminal). Replace both instances of username with your own username.
scp [email protected]:/lustre/haven/courses/EPP622-2018Fa/username/e_coli_fixed/analysis/1_fastqc/*html .
Open these files and see what kind of quality we have.