Ancestral vs. derived allele #585

hammer · 2021-05-27T03:23:06Z

Motivated by @hyanwong at https://github.com/pystatgen/sgkit/discussions/580

Some popgen methods need to know which allele at each site is the ancestral allele (cf. https://biology.stackexchange.com/questions/19159/ancestral-allele-explanation).

We should augment our data model to optionally store allele state.

A verbose approach would be to add a Boolean data variable named variant_allele_ancestral_state along the (variants, alleles) dimensions that would be True when the allele is ancestral and False otherwise.

Given that the alleles dimension has a fixed length which is often > 2, I suppose it’s a design decision to determine how to distinguish a derived allele from a null allele at a site with fewer alleles than the length of the alleles dimension.

This also makes me think we should rename the variants dimension to sites, but that can be addressed in a separate issue.

The text was updated successfully, but these errors were encountered:

eric-czech · 2021-05-27T10:43:42Z

I'd also suggest variant_allele_ancestral_index as a 1D (variants,) array since that would be a very convenient form for fancy indexing into variant_alleles and would also be directly usable for computations on call_genotype (e.g. ancestral allele counts), and it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

hyanwong · 2021-05-27T10:46:31Z

it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

I think you can safely assume that for a given dataset, there is only ever 1 ancestral allele for a site (or it is unknown).

edit - A more sophisticated version could have a probability of each allele being the ancestral one, but I think that's needless complexity for most people.

jeromekelleher · 2021-06-01T18:27:03Z

I'd vote for a variant of @eric-czech's suggestion above: variant_allele_ancestral_state which is an index into the variant_alleles array. There's +/- for explicitly adding the _index suffix, but since users will need to be comfortable with working with indexes in the first place for call_genotype, I think we can leave it out.

hyanwong · 2023-07-03T16:56:14Z

I think this is addressed now in SGkit, isn't it @benjeffery ? So I can close this?

benjeffery · 2023-07-04T10:18:15Z

There is no explicit support, and nothing has been added to the data model as far as I'm aware. I think this stays open for now.

hyanwong mentioned this issue Oct 11, 2022

SampleData frontend for sgkit/xarray dataset tskit-dev/tsinfer#728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestral vs. derived allele #585

Ancestral vs. derived allele #585

hammer commented May 27, 2021 •

edited

Loading

eric-czech commented May 27, 2021

hyanwong commented May 27, 2021 •

edited

Loading

jeromekelleher commented Jun 1, 2021

hyanwong commented Jul 3, 2023

benjeffery commented Jul 4, 2023

Ancestral vs. derived allele #585

Ancestral vs. derived allele #585

Comments

hammer commented May 27, 2021 • edited Loading

eric-czech commented May 27, 2021

hyanwong commented May 27, 2021 • edited Loading

jeromekelleher commented Jun 1, 2021

hyanwong commented Jul 3, 2023

benjeffery commented Jul 4, 2023

hammer commented May 27, 2021 •

edited

Loading

hyanwong commented May 27, 2021 •

edited

Loading