Simulate linked depth by simulating reads #26

MoiColl · 2022-06-20T13:57:59Z

No description provided.

codecov · 2022-06-20T13:58:45Z

Codecov Report

Merging #26 (4b64025) into main (efcc36e) will decrease coverage by 5.54%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
- Coverage   72.86%   67.32%   -5.55%     
==========================================
  Files           2        2              
  Lines         129      153      +24     
==========================================
+ Hits           94      103       +9     
- Misses         35       50      +15

Impacted Files	Coverage Δ
simGL/simGL.py	`67.10% <44.44%> (-5.56%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efcc36e...4b64025. Read the comment docs.

grahamgower · 2022-06-20T14:56:19Z

simGL/simGL.py

@@ -55,7 +55,43 @@ def refalt_int_encoding(gm, ref, alt):
    refalt_int[refalt_str == "T"] = 3
    return refalt_int[gm.reshape(-1), np.repeat(np.arange(gm.shape[0]), gm.shape[1])].reshape(gm.shape)

-def sim_allelereadcounts(gm, mean_depth, e, ploidy, seed = None, std_depth = None, ref = None, alt = None):
+def linked_depth(rng, DPh, read_length, sites_n):


I don't think this is quite right. Try looking at the mean depth as a function of genomic position and you'll see why.

rng = np.random.default_rng() DPh = np.array([5] * 500) # 500 haplotypes each with depth 5 linked = linked_depth(rng, DPh, 100, 300) plt.plot(np.mean(linked, axis=0), label="linked") plt.show()

To me, that's correct. This is the effect of having reads of a certain length mapping in a limited region. Then, because you have fewer reads spanning the edges of the sequence, you have less coverage there. Thus, the mean coverage for the rest of the sequence needs to increase to compensate for this edge effect.

This can be alleviated if the sequence is longer

rng = np.random.default_rng() DPh = np.array([5] * 500) # 500 haplotypes each with depth 5 linked = linked_depth(rng, DPh, 100, 30000) plt.plot(np.mean(linked, axis=0), label="linked") plt.show()

If you were to calculate the mean coverage per individual, you'd see that it matches quite well with the expected. For example:

rng = np.random.default_rng() DPh = np.array([5] * 50) # 500 haplotypes each with depth 5 linked = linked_depth(rng, DPh, 100, 300) linked.mean(axis = 1) > array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

At this point I realized that the dimensions of the matrix returned are reversed, and thus, I need to transpose the output in the original function with return np.array(DP).T. I'll change that later!

Even if the mean coverage inputed per individual is not "5" (I apply some variance sampling coverage values from a norm distribution with sd = 1), the mean coverage per individual outputted is quite close to the inputed:

rng = np.random.default_rng() DPh = rng.normal(loc=5, scale=1.0, size=50) DPh > array([5.30418801, 5.24747888, 4.36645145, 3.67576215, 3.75636558, 4.6398584 , 4.61245799, 5.30115935, 4.57175466, 5.89714514, 5.62581872, 5.52430759, 5.58910025, 5.42643025, 3.917686 , 4.76126296, 3.69152416, 5.16517386, 4.76283951, 5.22580705, 5.16908739, 4.58332057, 5.6806694 , 6.03509974, 5.98801939, 4.95102787, 5.80498872, 4.72375886, 5.77273645, 6.67349389, 4.46026638, 5.43974589, 4.0169985 , 5.08819149, 5.58219182, 5.0738201 , 4.78218583, 3.85927317, 6.9650079 , 6.02060105, 4.07702383, 4.27175255, 4.88942867, 5.6765129 , 4.03695654, 4.62093238, 5.05252965, 4.4394403 , 4.40764166, 4.23782541]) linked = linked_depth(rng, DPh, 100, 30000) linked.mean(axis = 1) > array([5.30333333, 5.24666667, 4.36333333, 3.67333333, 3.75333333, 4.63666667, 4.61 , 5.3 , 4.57 , 5.89666667, 5.62333333, 5.52333333, 5.58666667, 5.42333333, 3.91666667, 4.76 , 3.69 , 5.16333333, 4.76 , 5.22333333, 5.16666667, 4.58 , 5.68 , 6.03333333, 5.98666667, 4.95 , 5.80333333, 4.72333333, 5.77 , 6.67333333, 4.46 , 5.43666667, 4.01666667, 5.08666667, 5.58 , 5.07333333, 4.78 , 3.85666667, 6.96333333, 6.02 , 4.07666667, 4.27 , 4.88666667, 5.67333333, 4.03666667, 4.62 , 5.05 , 4.43666667, 4.40666667, 4.23666667]) plt.scatter(DPh, linked.mean(axis = 0)) plt.plot(np.arange(10)[2:], np.arange(10)[2:]) plt.xlabel("Input") plt.ylabel("output") plt.show()

I guess you could argue that you want to get rid of this edge effect and have the same mean coverage for the whole simulated genomic region. To this end, I modified the original function to:

simulate a longer genomic region : original genomic region + 2*read length (line 02). This was to capture the edge effect observed before in "fake genomic regions" that later will be removed.

simulated reads as before (line 03 to 10).

trim "read length" bp from each edge (line 11). This way I remove the increasing and decreasing depth pattern observed before. However, the mean coverage of the remaining sequence is greater than the one inputed.

correct coverage per site per individual to the input coverage by subtracting the difference between the mean coverage of the region and the mean coverage inputed (line 12).

01 def linked_depth(rng, DPh, read_length, sites_n): 02 seq_length = sites_n+(2*read_length) 03 DP = [] 04 print(sites_n+(2*read_length)) 05 read_n = (DPh*seq_length/read_length).astype("int") 06 for r in read_n: 07 dp = np.zeros((seq_length,), dtype=int) 08 for p in rng.integers(low=0, high=seq_length-read_length+1, size=r): 09 dp[p:p+read_length] += 1 10 DP.append(dp.tolist()) 11 DP = (np.array(DP).T)[(1*read_length):(-1*read_length), :] 12 return np.round(DP-(DP.mean(axis = 0)-5).repeat(DP.shape[0]).reshape(DP.shape)) 13 14 rng = np.random.default_rng() 15 DPh = np.array([5] * 500) # 500 haplotypes each with depth 5 16 linked = linked_depth(rng, DPh, 100, 300)

Now the coverage per site is similar across sites in the region of interest

plt.plot(np.mean(linked, axis=1), label="linked") plt.show()

However, because I need to round the coverage per site (coverage needs to be an integer, not a float), the mean coverage per individual does not match the one inputed:

linked.mean(axis = 1) > array([5.112, 5.318, 4.896, 4.704, 4.704, 5.89 , 4.682, 4.69 , 5.292, 4.946, 4.964, 4.352, 4.758, 5.354, 5.34 , 4.75 , 5.152, 4.94 , 4.944, 5.146, 4.728, 5.708, 5.72 , 6.31 , 4.302, 5.688, 5.272, 5.286, 5.896, 5.278, 5.286, 5.064, 4.864, 4.878, 4.256, 4.268, 4.698, 4.726, 5.336, 4.956, 5.162, 5.372, 4.354, 5.358, 4.862, 4.34 , 4.33 , 4.922, 5.322, 5.142, 4.948, 5.306, 5.29 , 5.882, 4.66 , 5.25 , 5.846, 5.268, 6.038, 5.646, 4.254, 5.02 , 5.842, 5.856, 5.448, 5.046, 5.02 , 5.438, 5.226, 4.628, 4.638, 4.612, 4.796, 4.79 , 4.788, 4.778, 4.984, 4.556, 4.752, 4.954, 5.766, 5.156, 4.564, 4.58 , 5.186, 5.398, 4.982, 4.79 , 5.594, 4.582, 5.196, 4.63 , 5.242, 5.242, 5.23 , 5.212, 5.21 , 5.214, 5.202, 5.226, 5.236, 5.234, 4.836, 4.856, 4.248, 5.424, 4.25 , 4.846, 5.05 , 6.032, 4.614, 5.62 , 4.842, 5.264, 4.664, 5.266, 5.264, 5.258, 5.274, 5.276, 5.684, 4.688, 4.89 , 5.284, 5.888, 5.292, 5.314, 4.504, 5.478, 4.888, 4.492, 4.714, 5.716, 4.91 , 5.352, 5.356, 4.758, 5.356, 5.54 , 5.138, 5.336, 4.734, 5.336, 5.332, 4.732, 4.746, 5.336, 4.728, 4.71 , 4.71 , 6.11 , 5.732, 5.314, 5.316, 5.318, 5.322, 5.322, 4.72 , 5.35 , 4.748, 4.566, 4.976, 5.364, 4.554, 4.344, 4.33 , 4.328, 4.306, 5.912, 5.304, 5.306, 5.334, 5.142, 4.942, 5.118, 4.33 , 5.346, 4.748, 5.152, 4.94 , 4.564, 5.57 , 4.752, 4.934, 4.32 , 4.906, 4.324, 5.122, 5.916, 4.924, 5.114, 5.294, 5.298, 5.888, 4.682, 4.264, 4.468, 5.286, 4.89 , 4.474, 5.282, 4.682, 5.28 , 4.648, 5.446, 4.848, 4.234, 5.206, 5.6 , 4.786, 5.806, 4.612, 5.186, 4.162, 4.778, 4.178, 5.604, 4.63 , 4.808, 4.794, 4.778, 5.392, 4.59 , 4.812, 5.012, 5.228, 4.388, 5.374, 5.182, 4.782, 5.594, 5.19 , 5.166, 5.17 , 4.954, 5.034, 4.524, 5.092, 5.696, 5.09 , 4.692, 4.302, 5.108, 5.11 , 4.728, 6.108, 4.728, 4.308, 5.122, 5.498, 5.286, 5.074, 3.69 , 4.704, 5.106, 5.112, 4.908, 4.692, 5.666, 4.474, 5.076, 5.078, 4.87 , 4.676, 5.694, 5.098, 5.126, 4.74 , 5.554, 5.138, 5.502, 5.678, 4.086, 4.078, 4.28 , 5.088, 4.468, 4.48 , 4.47 , 4.702, 5.062, 5.462, 4.466, 4.87 , 4.086, 4.486, 4.282, 4.704, 4.678, 4.884, 4.496, 4.894, 5.274, 5.686, 5.486, 4.506, 4.69 , 5.272, 4.882, 5.098]) plt.hist(linked.mean(axis = 1))

So, if a long sequence is simulated, the edge effect in the former strategy is reduced and might not have a huge effect on the downstream results. Moreover, this strategy also reflects what happens in reality since you might have this edge effect in real data. The latter approximation gets rid of the edge effect and provides a similar mean coverage over all sites simulated (with stochastic variation), but the mean coverage per individual is not guaranteed to be the same as the one inputed.

Thus, I'd rather stick to the former approximation. What do you think? Do you have another idea to get rid of the edge effect?

I can imagine circumstances where one would want edge effects, and where one wouldn't want them. So stick with your first implementation. Just make sure the behaviour is well documented and it won't be a problem---If edge effects are important to someone, they can always ask for a longer sequence than they need and then truncate the output.

…een DP and gm

…ing other functions such as calculating Major and minor alleles based on Genotype Likelihoods

grahamgower · 2024-08-22T05:05:10Z

I resolved the merge conflict here by rebasing against main, and manually editing the conflicting files. To continue with your work here, you'll need to be on your simreads branch and then force pull (git checkout simreads then git pull --force).

…alization its own function and rearanging the indeces in the GL output in the subset_GL function to correspond to the indeces inputed in the alleles_per_site argument

grahamgower reviewed Jun 20, 2022

View reviewed changes

MoiColl and others added 3 commits August 22, 2024 14:25

create functions linked_depth and independent_depth

87fd735

correct transposed issue in linked_depth() and assert dimentions betw…

106a672

…een DP and gm

Making read depth linked calculation as a circular sequence and refin…

a0993df

…ing other functions such as calculating Major and minor alleles based on Genotype Likelihoods

grahamgower force-pushed the simreads branch from 88aaf08 to a0993df Compare August 22, 2024 05:03

Improving the Major-minor identification function, making the GL norm…

e3c24a2

…alization its own function and rearanging the indeces in the GL output in the subset_GL function to correspond to the indeces inputed in the alleles_per_site argument

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulate linked depth by simulating reads #26

Simulate linked depth by simulating reads #26

MoiColl commented Jun 20, 2022

codecov bot commented Jun 20, 2022 •

edited

Loading

grahamgower Jun 20, 2022

MoiColl Jun 21, 2022

grahamgower Jun 21, 2022 •

edited

Loading

grahamgower commented Aug 22, 2024 •

edited

Loading

Simulate linked depth by simulating reads #26

Are you sure you want to change the base?

Simulate linked depth by simulating reads #26

Conversation

MoiColl commented Jun 20, 2022

codecov bot commented Jun 20, 2022 • edited Loading

Codecov Report

grahamgower Jun 20, 2022

Choose a reason for hiding this comment

MoiColl Jun 21, 2022

Choose a reason for hiding this comment

grahamgower Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

grahamgower commented Aug 22, 2024 • edited Loading

codecov bot commented Jun 20, 2022 •

edited

Loading

grahamgower Jun 21, 2022 •

edited

Loading

grahamgower commented Aug 22, 2024 •

edited

Loading