How do I add a new sample or variant data variable from a file? #1151

hammer · 2023-12-02T22:40:13Z

We have How do I define a new variable based on others?, and we have Adding column fields in the GWAS docs. I think we could expand this documentation to include the case when a sample annotation file doesn't contain a matching index, for example. I always forget the deal with dims and coords and whatnot, and ensuring you add data variables as Dask arrays that match the chunking of the other data variables is useful too.

The text was updated successfully, but these errors were encountered:

hammer · 2023-12-03T02:13:38Z

Here's an example that took me way too long to figure out.

I wanted to add one column of sample metadata to my dataset. The file with the new column is very simple, it just has three-letter sample ancestry with no header and no sample id, e.g.

AFR
AFR
AFR
AMR
AMR
AMR
EAS
EAS
...

In the GWAS tutorial, we have a full dataframe with sample id as index and several new columns, so we make a new dataset and use merge.

In this case, we don't need a full merge. Here's what I ended up doing:

ancestry_file = 'gs://hapnest/example/'+file_base+'.sample'
df = pd.read_csv(ancestry_file, header=None)

# Make dask array from df and add to ds as a new data variable called "sample_ancestry" with the "samples" dimension
# Is there a better dtype to use than "object"?
ancestry_da = da.from_array(df[0].values, chunks=(600,))

ds = ds.assign(sample_ancestry=('samples', ancestry_da))

What made this tricky for me:

Make the data variable a Dask array with the same chunking as the other data variables. Quite easy but something new users might not know about.
Figuring out the syntax of assign that would align the new data variable with the samples dimension.

Ultimately I think this was hard because the xarray docs for assign don't have enough examples. We can add more examples in our docs and maybe make some changes upstream.

hammer changed the title ~~How do I add a new sample or variant metadata column from a file?~~ How do I add a new sample or variant data variable from a file? Dec 2, 2023

hammer added the documentation Improvements or additions to documentation label Dec 2, 2023

hammer mentioned this issue Jan 6, 2024

How do I... additions #1165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I add a new sample or variant data variable from a file? #1151

How do I add a new sample or variant data variable from a file? #1151

hammer commented Dec 2, 2023

hammer commented Dec 3, 2023 •

edited

Loading

How do I add a new sample or variant data variable from a file? #1151

How do I add a new sample or variant data variable from a file? #1151

Comments

hammer commented Dec 2, 2023

hammer commented Dec 3, 2023 • edited Loading

hammer commented Dec 3, 2023 •

edited

Loading