Releases: PGScatalog/pgsc_calc
v2.0.0
We've marked this release as the first full release of v2
, linked with our recent publication describing the calculator in full (Lambert, Wingfield, et al. Nature Genetics. 2024.).
Changelog
Improvements
- Make report shareable by default
- Remove individual level data
- Don't show density plots with small sample sizes
- Add warnings about complex alleles (e.g. HLA/APOE) and dosage specific effect weights to the report
- The variant verification step added in
2.0.0-beta.3
has been integrated intopgscatalog-aggregate
- The symmetric difference of scoring file variant IDs (
.scorefile.gz
) and variants that contributed to the final calculated score (.vars
plink file) must be an empty set
- The symmetric difference of scoring file variant IDs (
Bug fixes
- Stop crashing when encountering a scoring file with dosage specific effect weights (skip instead)
- Fix report logo
- Fix VCF input with JSON samplesheets
- Add tar to zstd conda environment to prevent very old tar installs failing to extract the database
- Reduce download max thread workers to prevent throttling by the EBI
- Fix variant verification step failing in some conda deployments
Full Changelog: v2.0.0-beta.3...v2.0.0
Note
- The
COMBINE_SCOREFILES
process may take longer to finish and use more memory than in previous versions - New internal variant data models were added in this release to improve handling complex alleles and dosage specific effect weights
- Also, every variant (scoring file row) has many more validation steps now to ensure data quality and consistency
- Speed and memory usage will be improved in the next release
v2.0.0-beta.3
Changelog
Important fix: Fix splitting duplicated variant IDs across multiple scoring files
Background
- The
MATCH_COMBINE
step writes new scoring files for input toplink2 --score
- When plink2 encounters a variant with the same ID across multiple rows in a scoring file it will ignore duplicates and warn about them
- This only happens when the same variant ID has different effect alleles across different rows
- A variant ID with the same effect allele and scores across multiple columns is OK, this causes scores to be calculated in parallel
Example
When using PGS000039, PGS000040, and PGS000041 in parallel some variants have different effect alleles at the same coordinates, for example:
22:40682469:T:C
with effect allele T (PGS000041_hmPOS_GRCh38)
22:40682469:T:C
with effect allele C (PGS000039_hmPOS_GRCh38)
Impact
In versions v2.0.0-beta
, beta.1
, and beta.2
the duplicated variant is written to the same scoring file and ignored by plink2. The duplicated variant doesn't contribute to the final calculated PGS.
In all v2.0.0-alpha
versions and beta.3
a second scoring file is correctly written containing the other allele (additional alleles create extra scoring files automatically within the updated MATCH_COMBINE
process). We have also updated the software tests to ensure this error doesn't occur in future releases.
This problem is more likely to happen when larger scores are calculated in parallel. As more scores are calculated in parallel, it's more likely that variant IDs with different effect alleles will duplicate and be ignored during the score calculation stage.
While the overall impact on the final score is likely to be small we encourage users to upgrade to beta.3, especially if they calculate larger scores in parallel.
How do I know if my data are affected?
$ cd work/71/35fa3c977993b71d5a85fb6721e8c3 # cd to a scoring process directory
$ comm -3 <(sort hgdp_22_additive_0.sscore.vars) <(zcat hgdp_22_additive_0.scorefile.gz | tail -n +2 | cut -f 1 | sort)
22:40682469:T:C
One missing variant appears in the output. This check is now included in the scoring module.
Other fixes
- Fix
--keep_ambiguous
parameter #346 (@nebfield) - Fix variant matching information getting dropped from log when scores didn't pass the match rate threshold (@nebfield)
- Fix fraposa-pgsc handling exclusively numeric IIDs PGScatalog/fraposa_pgsc#18 (@smlmbrt)
v2.0.0-beta.2
Changelog
Features
- Add FID support internally (FID + IID must be unique for all samples) [@nebfield, thanks to @jasamack for initial draft fix]
- Add parameters to tune target variant missingness (
--pca_geno_miss_target
, default maximum 10%) and/or MAF (--pca_maf_target
, default no filtering) during intersection with the reference panel. [@smlmbrt]- The new defaults will help incorrect ancestry assignments when running the calculator on low sample sizes (revert to pre-beta version behaviour), as this behaviour was caused by the MAF filter before.
- Add
--efo_id
parameter, deprecating--trait_efo
which will be removed in a future release
Misc
- Remove default anaconda channels because of license changes #342
v2.0.0-beta.1
Changelog
Bug fixes
- Fix samplesheet parsing error warnings by @smlmbrt in #322
- Write consistent column sets to variant information files by @nebfield in #330
Full Changelog: v2.0.0-beta...v2.0.0-beta.1
v2.0.0-beta
Changelog
Graduating to beta with the release of our preprint 🎉
Improvements
- Improve aggregation PGScatalog/pygscatalog#23
- Improve matching performance PGScatalog/pygscatalog#22
- Improve match error docs #311
- Publish dependencies to Bioconda to improve conda profile UX
Bug fixes
- Fix for PGScatalog/pygscatalog#21
- Closes #301
- Specify modules explicitly to fix #312
- Fix bim input to
pgscatalog-aggregate
#319
pgsc_calc v2.0.0-alpha.6
Changelog
2024-05-28 update: We're investigating unexpected pgscatalog.core.lib.pgsexceptions.MatchRateError
in some environments (e.g. UK Biobank on a HPC). This release has been downgraded to a pre-release
Please note the minimum required nextflow version has been updated to v23.10.0, released in October 2023. Run nextflow self-update
to upgrade your nextflow version.
Improvements
- Migrate our custom python tools to new
pygscatalog
packages- Reference / target intersection now considers allelic frequency and variant missingness to determine PCA eligibility
- Downloads from PGS Catalog should be faster (async)
- Packages are now documented
- Update plink version to alpha 5.10 final #179
- Add docs describing cloud execution
- Add correlation test comparing calculated scores against known good scores
- When matching variants, matching logs are now written before scorefiles to improve debugging UX
- Improvements to PCA quality (ensuring low missingness and suitable MAF for PCA-eligble variants in target samples).
- This could allow us to implement MAF/missingness filters for scoring file variants in the future.
Bug fixes
- Fix ancestry adjustment with VCFs #252
- Fix support for scoring files that only have one effect type column #280
- Fix adjusting PGS with zero variance (skip them) #283
- Check for reserved characters in sampleset names
Known bug
- Incorrectly adjusting the
AVG
in--run_ancestry
mode #301 - unexpected
pgscatalog.core.lib.pgsexceptions.MatchRateError
in some environments (e.g. UK Biobank on a HPC)
pgsc_calc v2.0.0-alpha.5
Changelog
Improvements
- Automatically mount directories inside singularity containers without setting any configuration
- Improve permanent caching of ancestry processes with
--genotypes_cache
parameter - resync with nf-core framework
- Refactor combine_scorefiles to improve speed and quality control processes
Bug fixes
- Fix semantic storeDir definitions causing problems cloud execution (google batch)
- Fix missing DENOM values with multiple custom scoring files (score calculation not affected)
- Fix liftover failing silently with custom scoring files (thanks Brooke!)
Misc:
- Move aggregation step out of report
- Improve speed of
ANCESTRY_ANALYSIS
pgsc_calc v2.0.0-alpha.4
Changelog
Improvements
- Give a more helpful error message when there's no valid matches in
match_combine
Bug fixes
- Fix retrying downloads when the EBI servers are sleepy on a Monday morning
- Fix numeric sample identifiers breaking ancestry analysis
- Check chr prefix in samplesheets
pgsc_calc v2.0.0-alpha.3
Improvements:
- Automatically retry scoring with more RAM on larger datasets
- Describe scoring precision in docs
- Change handling of VCFs to reduce errors when recoding
- Internal changes to improve support for custom reference panels
Bug fixes:
- Fix VCF input to ancestry projection subworkflow (thanks
frahimov
andAWS-crafter
for patiently debugging) - Fix scoring options when reading allelic frequencies from a reference panel (thanks
raimondsre
for reporting the changes from v1.3.2 -> 2.0.0-alpha) - Fix conda profile action
pgsc_calc v2.0.0-alpha.2
Changelog
- Bump
pgscatalog_utils
v0.4.0 -> v0.4.1- Closes #165