-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windowing fails for datasets where contigs lack variant sites #1268
Comments
Thanks for the bug report @percyfal |
The obvious workaround here is to exclude contigs without variants. I tried to do so on an existing Zarr datastore, but it wasn't evident how to update a coordinate and write it back to the store, since that would require restructuring chunks. I ended up loading the VCF files with an updated VCF header that excludes contigs without variants. |
Should that be an option in vcf2zarr @percyfal? |
Yes, the thought has occurred to me. This would presumably be done at the |
It's a common thing to have loads of contigs in the header with no variants. We could keep track of the number of variants for each contig during explode, and then have the option of filtering out unreferenced contigs during encode. |
The function
sgkit.window_by_position
fails if the input dataset has contigs that lack variant sites, throwing anIndexError
.Traceback
This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.
Minimum working example
The VCF below is a modified version of the file
sgkit/tests/data/sample.vcf
where I added thecontig
key to the header:Convert to Zarr
and run code below to reproduce
The text was updated successfully, but these errors were encountered: