v2.0.0-beta.3
Changelog
Important fix: Fix splitting duplicated variant IDs across multiple scoring files
Background
- The
MATCH_COMBINE
step writes new scoring files for input toplink2 --score
- When plink2 encounters a variant with the same ID across multiple rows in a scoring file it will ignore duplicates and warn about them
- This only happens when the same variant ID has different effect alleles across different rows
- A variant ID with the same effect allele and scores across multiple columns is OK, this causes scores to be calculated in parallel
Example
When using PGS000039, PGS000040, and PGS000041 in parallel some variants have different effect alleles at the same coordinates, for example:
22:40682469:T:C
with effect allele T (PGS000041_hmPOS_GRCh38)
22:40682469:T:C
with effect allele C (PGS000039_hmPOS_GRCh38)
Impact
In versions v2.0.0-beta
, beta.1
, and beta.2
the duplicated variant is written to the same scoring file and ignored by plink2. The duplicated variant doesn't contribute to the final calculated PGS.
In all v2.0.0-alpha
versions and beta.3
a second scoring file is correctly written containing the other allele (additional alleles create extra scoring files automatically within the updated MATCH_COMBINE
process). We have also updated the software tests to ensure this error doesn't occur in future releases.
This problem is more likely to happen when larger scores are calculated in parallel. As more scores are calculated in parallel, it's more likely that variant IDs with different effect alleles will duplicate and be ignored during the score calculation stage.
While the overall impact on the final score is likely to be small we encourage users to upgrade to beta.3, especially if they calculate larger scores in parallel.
How do I know if my data are affected?
$ cd work/71/35fa3c977993b71d5a85fb6721e8c3 # cd to a scoring process directory
$ comm -3 <(sort hgdp_22_additive_0.sscore.vars) <(zcat hgdp_22_additive_0.scorefile.gz | tail -n +2 | cut -f 1 | sort)
22:40682469:T:C
One missing variant appears in the output. This check is now included in the scoring module.
Other fixes
- Fix
--keep_ambiguous
parameter #346 (@nebfield) - Fix variant matching information getting dropped from log when scores didn't pass the match rate threshold (@nebfield)
- Fix fraposa-pgsc handling exclusively numeric IIDs PGScatalog/fraposa_pgsc#18 (@smlmbrt)