Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancestral vs. derived allele #585

Open
hammer opened this issue May 27, 2021 · 5 comments
Open

Ancestral vs. derived allele #585

hammer opened this issue May 27, 2021 · 5 comments

Comments

@hammer
Copy link
Contributor

hammer commented May 27, 2021

Motivated by @hyanwong at https://github.com/pystatgen/sgkit/discussions/580

Some popgen methods need to know which allele at each site is the ancestral allele (cf. https://biology.stackexchange.com/questions/19159/ancestral-allele-explanation).

We should augment our data model to optionally store allele state.

A verbose approach would be to add a Boolean data variable named variant_allele_ancestral_state along the (variants, alleles) dimensions that would be True when the allele is ancestral and False otherwise.

Given that the alleles dimension has a fixed length which is often > 2, I suppose it’s a design decision to determine how to distinguish a derived allele from a null allele at a site with fewer alleles than the length of the alleles dimension.

This also makes me think we should rename the variants dimension to sites, but that can be addressed in a separate issue.

@eric-czech
Copy link
Collaborator

I'd also suggest variant_allele_ancestral_index as a 1D (variants,) array since that would be a very convenient form for fancy indexing into variant_alleles and would also be directly usable for computations on call_genotype (e.g. ancestral allele counts), and it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

@hyanwong
Copy link

hyanwong commented May 27, 2021

it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

I think you can safely assume that for a given dataset, there is only ever 1 ancestral allele for a site (or it is unknown).

edit - A more sophisticated version could have a probability of each allele being the ancestral one, but I think that's needless complexity for most people.

@jeromekelleher
Copy link
Collaborator

I'd vote for a variant of @eric-czech's suggestion above: variant_allele_ancestral_state which is an index into the variant_alleles array. There's +/- for explicitly adding the _index suffix, but since users will need to be comfortable with working with indexes in the first place for call_genotype, I think we can leave it out.

@hyanwong
Copy link

hyanwong commented Jul 3, 2023

I think this is addressed now in SGkit, isn't it @benjeffery ? So I can close this?

@benjeffery
Copy link
Collaborator

There is no explicit support, and nothing has been added to the data model as far as I'm aware. I think this stays open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants