You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have How do I define a new variable based on others?, and we have Adding column fields in the GWAS docs. I think we could expand this documentation to include the case when a sample annotation file doesn't contain a matching index, for example. I always forget the deal with dims and coords and whatnot, and ensuring you add data variables as Dask arrays that match the chunking of the other data variables is useful too.
The text was updated successfully, but these errors were encountered:
hammer
changed the title
How do I add a new sample or variant metadata column from a file?
How do I add a new sample or variant data variable from a file?
Dec 2, 2023
Here's an example that took me way too long to figure out.
I wanted to add one column of sample metadata to my dataset. The file with the new column is very simple, it just has three-letter sample ancestry with no header and no sample id, e.g.
AFR
AFR
AFR
AMR
AMR
AMR
EAS
EAS
...
In the GWAS tutorial, we have a full dataframe with sample id as index and several new columns, so we make a new dataset and use merge.
In this case, we don't need a full merge. Here's what I ended up doing:
ancestry_file = 'gs://hapnest/example/'+file_base+'.sample'
df = pd.read_csv(ancestry_file, header=None)
# Make dask array from df and add to ds as a new data variable called "sample_ancestry" with the "samples" dimension
# Is there a better dtype to use than "object"?
ancestry_da = da.from_array(df[0].values, chunks=(600,))
ds = ds.assign(sample_ancestry=('samples', ancestry_da))
What made this tricky for me:
Make the data variable a Dask array with the same chunking as the other data variables. Quite easy but something new users might not know about.
Figuring out the syntax of assign that would align the new data variable with the samples dimension.
Ultimately I think this was hard because the xarray docs for assign don't have enough examples. We can add more examples in our docs and maybe make some changes upstream.
We have How do I define a new variable based on others?, and we have Adding column fields in the GWAS docs. I think we could expand this documentation to include the case when a sample annotation file doesn't contain a matching index, for example. I always forget the deal with
dims
andcoords
and whatnot, and ensuring you add data variables as Dask arrays that match the chunking of the other data variables is useful too.The text was updated successfully, but these errors were encountered: