-
Suppose there are k unique Scores from the catalog that I want to generate scores from. Each of these has an associated number of variants (and dosages for these variants), and there may be some overlap among different Scores. If I want to create k scores, I will need a target VCF file that includes the (union of the) variants from each of the k Scores. Is there an easy way to prepare a VCF file that will have all of the required variants so that I can generate the k scores? I'd prefer if a solution resulted in genotypes that are in-distribution with respect to human genetic variation, but this is not a hard requirement since the data is synthetic anyway. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I'm not an expert (that's @smlmbrt!), but HAPNEST might be helpful to get you started making synthetic genomes: https://academic.oup.com/bioinformatics/article/39/9/btad535/7255913 https://github.com/intervene-EU-H2020/synthetic_data there's lots of configuration options here, including limiting generated variants to a specific subset It won't output a VCF, but it's easy to create a VCF from hapnest output (see plink2 --recode) |
Beta Was this translation helpful? Give feedback.
I'm not an expert (that's @smlmbrt!), but HAPNEST might be helpful to get you started making synthetic genomes:
https://academic.oup.com/bioinformatics/article/39/9/btad535/7255913
https://github.com/intervene-EU-H2020/synthetic_data there's lots of configuration options here, including limiting generated variants to a specific subset
It won't output a VCF, but it's easy to create a VCF from hapnest output (see plink2 --recode)