Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulate linked depth by simulating reads #26

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Simulate linked depth by simulating reads #26

wants to merge 4 commits into from

Conversation

MoiColl
Copy link
Collaborator

@MoiColl MoiColl commented Jun 20, 2022

No description provided.

@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #26 (4b64025) into main (efcc36e) will decrease coverage by 5.54%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
- Coverage   72.86%   67.32%   -5.55%     
==========================================
  Files           2        2              
  Lines         129      153      +24     
==========================================
+ Hits           94      103       +9     
- Misses         35       50      +15     
Impacted Files Coverage Δ
simGL/simGL.py 67.10% <44.44%> (-5.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efcc36e...4b64025. Read the comment docs.

simGL/simGL.py Outdated
@@ -55,7 +55,43 @@ def refalt_int_encoding(gm, ref, alt):
refalt_int[refalt_str == "T"] = 3
return refalt_int[gm.reshape(-1), np.repeat(np.arange(gm.shape[0]), gm.shape[1])].reshape(gm.shape)

def sim_allelereadcounts(gm, mean_depth, e, ploidy, seed = None, std_depth = None, ref = None, alt = None):
def linked_depth(rng, DPh, read_length, sites_n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. Try looking at the mean depth as a function of genomic position and you'll see why.

rng = np.random.default_rng()
DPh = np.array([5] * 500)  # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 300)
plt.plot(np.mean(linked, axis=0), label="linked")
plt.show()

mean_depth

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, that's correct. This is the effect of having reads of a certain length mapping in a limited region. Then, because you have fewer reads spanning the edges of the sequence, you have less coverage there. Thus, the mean coverage for the rest of the sequence needs to increase to compensate for this edge effect.

This can be alleviated if the sequence is longer

rng = np.random.default_rng()
DPh = np.array([5] * 500)  # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 30000)
plt.plot(np.mean(linked, axis=0), label="linked")
plt.show()

Captura de pantalla 2022-06-21 a les 8 48 09

If you were to calculate the mean coverage per individual, you'd see that it matches quite well with the expected. For example:

rng = np.random.default_rng()
DPh = np.array([5] * 50)  # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 300)
linked.mean(axis = 1)

> array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
       5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
       5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

At this point I realized that the dimensions of the matrix returned are reversed, and thus, I need to transpose the output in the original function with return np.array(DP).T. I'll change that later!

Even if the mean coverage inputed per individual is not "5" (I apply some variance sampling coverage values from a norm distribution with sd = 1), the mean coverage per individual outputted is quite close to the inputed:


rng = np.random.default_rng()
DPh = rng.normal(loc=5, scale=1.0, size=50)
DPh

> array([5.30418801, 5.24747888, 4.36645145, 3.67576215, 3.75636558,
       4.6398584 , 4.61245799, 5.30115935, 4.57175466, 5.89714514,
       5.62581872, 5.52430759, 5.58910025, 5.42643025, 3.917686  ,
       4.76126296, 3.69152416, 5.16517386, 4.76283951, 5.22580705,
       5.16908739, 4.58332057, 5.6806694 , 6.03509974, 5.98801939,
       4.95102787, 5.80498872, 4.72375886, 5.77273645, 6.67349389,
       4.46026638, 5.43974589, 4.0169985 , 5.08819149, 5.58219182,
       5.0738201 , 4.78218583, 3.85927317, 6.9650079 , 6.02060105,
       4.07702383, 4.27175255, 4.88942867, 5.6765129 , 4.03695654,
       4.62093238, 5.05252965, 4.4394403 , 4.40764166, 4.23782541])

linked = linked_depth(rng, DPh, 100, 30000)
linked.mean(axis = 1)

> array([5.30333333, 5.24666667, 4.36333333, 3.67333333, 3.75333333,
       4.63666667, 4.61      , 5.3       , 4.57      , 5.89666667,
       5.62333333, 5.52333333, 5.58666667, 5.42333333, 3.91666667,
       4.76      , 3.69      , 5.16333333, 4.76      , 5.22333333,
       5.16666667, 4.58      , 5.68      , 6.03333333, 5.98666667,
       4.95      , 5.80333333, 4.72333333, 5.77      , 6.67333333,
       4.46      , 5.43666667, 4.01666667, 5.08666667, 5.58      ,
       5.07333333, 4.78      , 3.85666667, 6.96333333, 6.02      ,
       4.07666667, 4.27      , 4.88666667, 5.67333333, 4.03666667,
       4.62      , 5.05      , 4.43666667, 4.40666667, 4.23666667])

plt.scatter(DPh, linked.mean(axis = 0))
plt.plot(np.arange(10)[2:], np.arange(10)[2:])
plt.xlabel("Input")
plt.ylabel("output")
plt.show()

Captura de pantalla 2022-06-21 a les 10 40 40

I guess you could argue that you want to get rid of this edge effect and have the same mean coverage for the whole simulated genomic region. To this end, I modified the original function to:

  1. simulate a longer genomic region : original genomic region + 2*read length (line 02). This was to capture the edge effect observed before in "fake genomic regions" that later will be removed.
  2. simulated reads as before (line 03 to 10).
  3. trim "read length" bp from each edge (line 11). This way I remove the increasing and decreasing depth pattern observed before. However, the mean coverage of the remaining sequence is greater than the one inputed.
  4. correct coverage per site per individual to the input coverage by subtracting the difference between the mean coverage of the region and the mean coverage inputed (line 12).
01 def linked_depth(rng, DPh, read_length, sites_n):
02    seq_length = sites_n+(2*read_length)
03    DP         = []
04    print(sites_n+(2*read_length))
05    read_n     = (DPh*seq_length/read_length).astype("int")
06    for r in read_n:
07        dp = np.zeros((seq_length,), dtype=int)
08        for p in rng.integers(low=0, high=seq_length-read_length+1, size=r):
09            dp[p:p+read_length] += 1
10        DP.append(dp.tolist())
11    DP = (np.array(DP).T)[(1*read_length):(-1*read_length), :]
12    return np.round(DP-(DP.mean(axis = 0)-5).repeat(DP.shape[0]).reshape(DP.shape))
13
14 rng = np.random.default_rng()
15 DPh = np.array([5] * 500)  # 500 haplotypes each with depth 5
16 linked = linked_depth(rng, DPh, 100, 300)

Now the coverage per site is similar across sites in the region of interest

plt.plot(np.mean(linked, axis=1), label="linked")
plt.show()

Captura de pantalla 2022-06-21 a les 9 53 16

However, because I need to round the coverage per site (coverage needs to be an integer, not a float), the mean coverage per individual does not match the one inputed:

linked.mean(axis = 1)

> array([5.112, 5.318, 4.896, 4.704, 4.704, 5.89 , 4.682, 4.69 , 5.292,
       4.946, 4.964, 4.352, 4.758, 5.354, 5.34 , 4.75 , 5.152, 4.94 ,
       4.944, 5.146, 4.728, 5.708, 5.72 , 6.31 , 4.302, 5.688, 5.272,
       5.286, 5.896, 5.278, 5.286, 5.064, 4.864, 4.878, 4.256, 4.268,
       4.698, 4.726, 5.336, 4.956, 5.162, 5.372, 4.354, 5.358, 4.862,
       4.34 , 4.33 , 4.922, 5.322, 5.142, 4.948, 5.306, 5.29 , 5.882,
       4.66 , 5.25 , 5.846, 5.268, 6.038, 5.646, 4.254, 5.02 , 5.842,
       5.856, 5.448, 5.046, 5.02 , 5.438, 5.226, 4.628, 4.638, 4.612,
       4.796, 4.79 , 4.788, 4.778, 4.984, 4.556, 4.752, 4.954, 5.766,
       5.156, 4.564, 4.58 , 5.186, 5.398, 4.982, 4.79 , 5.594, 4.582,
       5.196, 4.63 , 5.242, 5.242, 5.23 , 5.212, 5.21 , 5.214, 5.202,
       5.226, 5.236, 5.234, 4.836, 4.856, 4.248, 5.424, 4.25 , 4.846,
       5.05 , 6.032, 4.614, 5.62 , 4.842, 5.264, 4.664, 5.266, 5.264,
       5.258, 5.274, 5.276, 5.684, 4.688, 4.89 , 5.284, 5.888, 5.292,
       5.314, 4.504, 5.478, 4.888, 4.492, 4.714, 5.716, 4.91 , 5.352,
       5.356, 4.758, 5.356, 5.54 , 5.138, 5.336, 4.734, 5.336, 5.332,
       4.732, 4.746, 5.336, 4.728, 4.71 , 4.71 , 6.11 , 5.732, 5.314,
       5.316, 5.318, 5.322, 5.322, 4.72 , 5.35 , 4.748, 4.566, 4.976,
       5.364, 4.554, 4.344, 4.33 , 4.328, 4.306, 5.912, 5.304, 5.306,
       5.334, 5.142, 4.942, 5.118, 4.33 , 5.346, 4.748, 5.152, 4.94 ,
       4.564, 5.57 , 4.752, 4.934, 4.32 , 4.906, 4.324, 5.122, 5.916,
       4.924, 5.114, 5.294, 5.298, 5.888, 4.682, 4.264, 4.468, 5.286,
       4.89 , 4.474, 5.282, 4.682, 5.28 , 4.648, 5.446, 4.848, 4.234,
       5.206, 5.6  , 4.786, 5.806, 4.612, 5.186, 4.162, 4.778, 4.178,
       5.604, 4.63 , 4.808, 4.794, 4.778, 5.392, 4.59 , 4.812, 5.012,
       5.228, 4.388, 5.374, 5.182, 4.782, 5.594, 5.19 , 5.166, 5.17 ,
       4.954, 5.034, 4.524, 5.092, 5.696, 5.09 , 4.692, 4.302, 5.108,
       5.11 , 4.728, 6.108, 4.728, 4.308, 5.122, 5.498, 5.286, 5.074,
       3.69 , 4.704, 5.106, 5.112, 4.908, 4.692, 5.666, 4.474, 5.076,
       5.078, 4.87 , 4.676, 5.694, 5.098, 5.126, 4.74 , 5.554, 5.138,
       5.502, 5.678, 4.086, 4.078, 4.28 , 5.088, 4.468, 4.48 , 4.47 ,
       4.702, 5.062, 5.462, 4.466, 4.87 , 4.086, 4.486, 4.282, 4.704,
       4.678, 4.884, 4.496, 4.894, 5.274, 5.686, 5.486, 4.506, 4.69 ,
       5.272, 4.882, 5.098])

plt.hist(linked.mean(axis = 1))

Captura de pantalla 2022-06-21 a les 9 58 55

So, if a long sequence is simulated, the edge effect in the former strategy is reduced and might not have a huge effect on the downstream results. Moreover, this strategy also reflects what happens in reality since you might have this edge effect in real data. The latter approximation gets rid of the edge effect and provides a similar mean coverage over all sites simulated (with stochastic variation), but the mean coverage per individual is not guaranteed to be the same as the one inputed.

Thus, I'd rather stick to the former approximation. What do you think? Do you have another idea to get rid of the edge effect?

Copy link
Member

@grahamgower grahamgower Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine circumstances where one would want edge effects, and where one wouldn't want them. So stick with your first implementation. Just make sure the behaviour is well documented and it won't be a problem---If edge effects are important to someone, they can always ask for a longer sequence than they need and then truncate the output.

@grahamgower
Copy link
Member

grahamgower commented Aug 22, 2024

I resolved the merge conflict here by rebasing against main, and manually editing the conflicting files. To continue with your work here, you'll need to be on your simreads branch and then force pull (git checkout simreads then git pull --force).

…alization its own function and rearanging the indeces in the GL output in the subset_GL function to correspond to the indeces inputed in the alleles_per_site argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants