Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I add a new sample or variant data variable from a file? #1151

Open
hammer opened this issue Dec 2, 2023 · 1 comment
Open

How do I add a new sample or variant data variable from a file? #1151

hammer opened this issue Dec 2, 2023 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@hammer
Copy link
Contributor

hammer commented Dec 2, 2023

We have How do I define a new variable based on others?, and we have Adding column fields in the GWAS docs. I think we could expand this documentation to include the case when a sample annotation file doesn't contain a matching index, for example. I always forget the deal with dims and coords and whatnot, and ensuring you add data variables as Dask arrays that match the chunking of the other data variables is useful too.

@hammer hammer changed the title How do I add a new sample or variant metadata column from a file? How do I add a new sample or variant data variable from a file? Dec 2, 2023
@hammer hammer added the documentation Improvements or additions to documentation label Dec 2, 2023
@hammer
Copy link
Contributor Author

hammer commented Dec 3, 2023

Here's an example that took me way too long to figure out.

I wanted to add one column of sample metadata to my dataset. The file with the new column is very simple, it just has three-letter sample ancestry with no header and no sample id, e.g.

AFR
AFR
AFR
AMR
AMR
AMR
EAS
EAS
...

In the GWAS tutorial, we have a full dataframe with sample id as index and several new columns, so we make a new dataset and use merge.

In this case, we don't need a full merge. Here's what I ended up doing:

ancestry_file = 'gs://hapnest/example/'+file_base+'.sample'
df = pd.read_csv(ancestry_file, header=None)

# Make dask array from df and add to ds as a new data variable called "sample_ancestry" with the "samples" dimension
# Is there a better dtype to use than "object"?
ancestry_da = da.from_array(df[0].values, chunks=(600,))

ds = ds.assign(sample_ancestry=('samples', ancestry_da))

What made this tricky for me:

  1. Make the data variable a Dask array with the same chunking as the other data variables. Quite easy but something new users might not know about.
  2. Figuring out the syntax of assign that would align the new data variable with the samples dimension.

Ultimately I think this was hard because the xarray docs for assign don't have enough examples. We can add more examples in our docs and maybe make some changes upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant