How do I create synthetic VCF data to test the scoring engine across multiple scoring models? #289

tiwalayo · 2024-05-01T15:30:55Z

tiwalayo
May 1, 2024

Suppose there are k unique Scores from the catalog that I want to generate scores from. Each of these has an associated number of variants (and dosages for these variants), and there may be some overlap among different Scores. If I want to create k scores, I will need a target VCF file that includes the (union of the) variants from each of the k Scores.

Is there an easy way to prepare a VCF file that will have all of the required variants so that I can generate the k scores? I'd prefer if a solution resulted in genotypes that are in-distribution with respect to human genetic variation, but this is not a hard requirement since the data is synthetic anyway.

Answered by nebfield

May 1, 2024

I'm not an expert (that's @smlmbrt!), but HAPNEST might be helpful to get you started making synthetic genomes:

https://academic.oup.com/bioinformatics/article/39/9/btad535/7255913

https://github.com/intervene-EU-H2020/synthetic_data there's lots of configuration options here, including limiting generated variants to a specific subset

It won't output a VCF, but it's easy to create a VCF from hapnest output (see plink2 --recode)

View full answer

nebfield · 2024-05-01T15:41:11Z

nebfield
May 1, 2024
Maintainer

I'm not an expert (that's @smlmbrt!), but HAPNEST might be helpful to get you started making synthetic genomes:

https://academic.oup.com/bioinformatics/article/39/9/btad535/7255913

https://github.com/intervene-EU-H2020/synthetic_data there's lots of configuration options here, including limiting generated variants to a specific subset

It won't output a VCF, but it's easy to create a VCF from hapnest output (see plink2 --recode)

1 reply

smlmbrt May 2, 2024
Maintainer

I think @nebfield's answer is most correct, you probably want to simulate genomes for all sites in the union of all variants in the PGS, then apply the calculator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I create synthetic VCF data to test the scoring engine across multiple scoring models? #289

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How do I create synthetic VCF data to test the scoring engine across multiple scoring models? #289

tiwalayo May 1, 2024

Replies: 1 comment · 1 reply

nebfield May 1, 2024 Maintainer

smlmbrt May 2, 2024 Maintainer

tiwalayo
May 1, 2024

Replies: 1 comment 1 reply

nebfield
May 1, 2024
Maintainer

smlmbrt May 2, 2024
Maintainer