-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draw error rate from a distribution #21
Comments
Do we have any information on how those errors are distributed? |
It depends on the sequencing platform. I did something similar a few years ago: https://github.com/grahamgower/PP5mC#quality-score-profiles |
For ancient DNA, the base quality scores may have been "recalibrated" using, e.g. mapDamage, to account for possible deamination at the terminal ends of reads. I'm not sure how we could incorporate this nicely without simulating the reads in conjunction with a putative ancestral genome sequence (so as to identify where possible C->T or G->A substitutions could be). But this is something that empirical datasets do, so its worth considering how we could close the gap between a simulated training dataset and the empirical dataset(s) the trained model will be applied to. |
I was considering doing something similar to what I'm doing with
But I don't think it makes sense to sample error probabilities from a normal distribution. Furthermore, the errors for each nucleotide have their own biases (C>T deamination). |
How about using a discrete distribution for the error distribution? This is what we get in practice with PHRED scores in a fastq file, so it should match up pretty well with empirical data. |
Ok! I did some analysis to check how de quality scores for each base pair of reads are distributed in real data and how are they correlated. 1. Empirical read base-pair quality distributionsI download different fastq files from different samples
with the command
Then, with
And obtained the desired information from the file
To read such a file I created a
I then obtained all the distributions for each fastq with
And plot those distributions in R
We can see from this figure that the shape of the distributions look quite alike regardless of the coverage and age of the sample, but with different mean and variance values. This suggests to me that if we could find a parametric distribution that behaved like this, we could change the mean and variance values and draw quality values from such distribution. Otherwise, I also imagine the option to input an empirical distribution (fastqc text file output) and use that to sample quality values per base pair. If the user does not specify anything regarding the quality values, then by default we could use one of the HGDP fastq files quality distributions. A final option that the user might also have is to just introduce a quality value which will be constant for all nucleotides simulated. 2. AutocorrelationI took the file produced from the cram file as explained in #22, and first checked the histogram of the distribution
It looks different from the previous distributions, probably because I took a subset of a few reads. Then, I computed the difference in score between consecutive positions and plotted the difference as a measure of autocorrelation of the quality score between adjacent bases. Here is the plot of this value per site in the genome
The distribution of the value
And some summary statistics
In order to compare to what would happen if there was no autocorrelation, I permuted the quality column from the dataset and recalculated the statistic.
The variance is smaller in the real dataset, pointing towards autocorrelation. Furthermore, I did 10,000 permutations and checked the variance value for each permutation and then compared it to the real value. None of the permutations had a value equal to or more extreme than the real one. Thus, it is very likely that the quality scores of a read are autocorrelated. However, since this would be more complicated to simulate than autocorrelation for coverage #22, and might not be so critical for simulating Genotype Likelihoods, for now, I won't simulate such autocorrelation. 3. Differences between bam and fastq quality scoresI've taken the reads in the original cram file and created a fastq file out of it with
I didn't let the whole thing to be processed since it is a file very big. Thus, because I cancelled the job midway, I had to remove a few lines from the end of the file. Then, I read the data with the previous function for the new
and plot
We can see that the cram quality distributions have changed quite a lot: quality scores are lower in general and the distribution looks more like bimodal. In the case of |
With your last plot, which colour is the mapped data and which is the unmapped? The mapped data should be a proper subset of the unmapped data, right? |
Red is the mapped. I thought the same as you, but then, could it be that there is some recalibration of the quality scores after mapping? or that the quality value scales have been reformated to a different Phred scale? |
I don't know, but there's something funny going on there. |
E.g. from an empirical distribution observed for a given sequencing platform. These can then be returned to the user in the form of Phred-scaled base quality scores, which some genotype likelihood models use explicitly.
The text was updated successfully, but these errors were encountered: