Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dataset wrappers in prototype core.py #39

Open
eric-czech opened this issue Jun 16, 2020 · 0 comments
Open

Remove dataset wrappers in prototype core.py #39

eric-czech opened this issue Jun 16, 2020 · 0 comments

Comments

@eric-czech
Copy link
Collaborator

We would like to move towards an API of only functions that act on or create Xarray datasets. The wrapper classes in core.py should be removed and the conversion functions in them moved elsewhere.

A few problems that the wrappers were, at least in part, intended to solve are:

  • What conventions should I/O readers adhere to when building datasets?
    • Should they have default coordinates? This helps facilitate indexing/selecting data but it could be left up to the user.
    • Should the strategy for representing missing values in floating point data be the same as the strategy from other readers of integer types? If we go the scikit-allele sentinel + boolean mask route then probably not but if we use masked Dask arrays then it likely makes more sense for the reader to be responsible for creating them.
    • How should we represent phased genotypes? As far as I've seen, phasing could be specific to only variants, variants + samples, or an entire dataset so it may make sense for readers to return a 1D array, a 2D array, or global attributes (whatever is most appropriate).
  • How do we assert dimensions and dtypes on datasets? Maybe we shouldn't do this at all, or there could be a functions to do this at the beginning of method functions (like scikit-learn).
  • How do we standardize naming conventions for fields like contig, pos, alleles, GT, etc.?
  • How do we make it clear which kinds of datasets can be converted to others? It is probably best to have functions for things like computing dosages, hard calls, GWAS encodings, allele counts etc. that take arrays/datasets and leave it up to users not to pass them anything that doesn't make sense.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant