Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scglue.data.get_gene_annotation gives many NaN values. #122

Open
oldvalley49 opened this issue Jul 18, 2024 · 5 comments
Open

scglue.data.get_gene_annotation gives many NaN values. #122

oldvalley49 opened this issue Jul 18, 2024 · 5 comments

Comments

@oldvalley49
Copy link

Hello!

Thank you for developing this tool.
I wanted to reach out to see if something was going wrong with my code. I'm using dataset preprocessed with Seurat for GLUE and following the tutorial for scRNA and scATAC integration. When I run this code segment:

scglue.data.get_gene_annotation(
pbmc_rna, gtf="gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz",
gtf_by="gene_name"
)

many of the genes end up getting NaN in chrom, chromStart, and chromEnd. The genes that fail to assign the ranges seem to be those that start with AL such as AL627309.1, AL590822.1. How can I fix this issue? Thank you in advance.

@TuDou-PK
Copy link

same problem for me, do you solve it?

@zlqq1001
Copy link

The same problem and it came always in my 3 datasets. First I tried to remove these NA gene's rows of rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]], and everything went well in the step1 of pre-treating data with 0 bugs. But it came a new problem in the beginning of step2, the model training, that I can't load RNA-pp.h5ad files came from step1, just because of the before deletion of these NA rows:
rna = ad.read_h5ad("rna_pp.h5ad")
ValueError: Variables annot. var must have number of columns of X (11787), but has 10215 rows.

Then I tried to fill these "chrom" NA by "chrn" and "chromStart&end" NA by some random number, but it still couldn't work when I ran:
guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)
and the error was: ValueError: Not all features are strand specific!

So how should this problem be handled? Around 10% genes fail to assign the ranges and how to do with these genes NA rows? Thanks for advance.

@Jeff1995
Copy link
Collaborator

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

  1. Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;
  2. Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

@zlqq1001
Copy link

zlqq1001 commented Aug 7, 2024

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

  1. Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;
  2. Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

Thanks for reply. I tried 2 and it did work well!
Here's my codes:
I used 'gencode.v46.chr_patch_hapl_scaff.annotation.gtf.gz' and had 1572 genes which couldn't be matched.
#dims
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 11787 rows

#find NA
na_rows = rna.var.loc[:, ["chromStart", "chromEnd"]].isna().any(axis=1)
rna.var[na_rows].loc[:, ["chrom", "chromStart", "chromEnd"]] #1572 NA rows

delete NA rows

rna.var.dropna(subset=["chromStart", "chromEnd"], inplace=True)
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 10215 rows

#delete NA rows in anondata
rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]

#change the type to int
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].dtypes
rna.var["chromStart"] = rna.var["chromStart"].astype(int)
rna.var["chromEnd"] = rna.var["chromEnd"].astype(int)

@oldvalley49
Copy link
Author

thanks for getting back; 1 works well for me. for anyone encountering similar issues in the future: if you are working with data from 10x genomics you can download their reference file from their website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants