-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore typed/schema-based Dataset options #43
Comments
For reference, scikit-allel defined a number of array classes with methods, e.g., GenotypeArray class has methods like count_alleles(), and AlleleCountsArray class has methods like is_segregating(). This is implemented via encapsulation, e.g., GenotypeArray and AlleleCountsArray are wrappers around a numpy array. We encountered some problems with this pattern when used with dask distributed. Also some usability challenges - explaining the concept of a wrapper class to folks from an analysis background can be challenging. When exploring designs for the skallel v2 prototype, I went for a purely functional API, i.e., no classes. Instead there are functions that are effectively methods, in the sense that they expect a certain type of input, e.g., genotypes_count_alleles() and allele_counts_is_segregating(). That has it's own trade-offs though, particularly around achieving clear function naming. |
One question I have is how to handle the very simple use case where you want to subset a genotype call dataset to only segregating variants. In scikit-allel you would do something like:
What would we do in sgkit? |
I believe @ravwojdyla is experimenting in the api_dataset branch for those that would like to follow along. |
Here are a few notes on some topics/discussions I had related to applying schemas to a dataset:
|
BTW re: is_segregating (@alimanfoo) I'm not sure if this would be more efficient than basing an # This looks for any calls that are different from a nan-aware
# mean (which would be the same w/ no segregation)
is_segregating = (
ds
.assign(cgo=ds.call_genotype, cgf=ds.call_genotype >= 0)
.assign(cgm=lambda ds: ds.cgo.weighted(ds.cgf).mean(dim=('samples', 'ploidy')))
.pipe(lambda ds: ds.cgf & (ds.cgo != ds.cgm))
.any(dim=('samples', 'ploidy'))
) Gist with example like scikit-allele.is_segregating: https://gist.github.com/eric-czech/41b3827cbae91a3694fc2bf2c6d87911 This uses the xarray.core.weighted.DataArrayWeighted subclass which I haven't seen before. Pretty neat. |
Current POC branch has two parts:
Schema definition and validation:
Generic container for xarray:
Going forward I will submit a PR for only the 1st part, and we can decide about the 2nd part later. Does this sound ok with both of you @tomwhite @eric-czech? This obviously doesn't mean that we will merge the 1st PR, just that we can have a more concrete discussion about the implementation details etc. |
Sounds good @ravwojdyla. That first part would include the schema definitions for each function result too right (i.e. your changes to |
@eric-czech thanks - yep, that's correct. Right out of the way if there are any comments you have or @tomwhite to the current declaration, please let me know (example). |
+1 I think schema definition and validation is the more valuable part here. |
Renamed this issue to better capture the core of the problem/issue ^ |
An interesting discussion on the use of Pydantic with Xarray happening at openclimatefix/nowcasting_dataset#211, in case we ever revisit this topic in the future! On the Xarray side the relevant issue is pydata/xarray#1900. |
https://github.com/carbonplan/xarray-schema looks relevant |
Raising this issue to break out discussion of API design originating in #16.
The text was updated successfully, but these errors were encountered: