scglue.data.get_gene_annotation gives many NaN values. #122

oldvalley49 · 2024-07-18T13:51:12Z

Hello!

Thank you for developing this tool.
I wanted to reach out to see if something was going wrong with my code. I'm using dataset preprocessed with Seurat for GLUE and following the tutorial for scRNA and scATAC integration. When I run this code segment:

scglue.data.get_gene_annotation(
pbmc_rna, gtf="gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz",
gtf_by="gene_name"
)

many of the genes end up getting NaN in chrom, chromStart, and chromEnd. The genes that fail to assign the ranges seem to be those that start with AL such as AL627309.1, AL590822.1. How can I fix this issue? Thank you in advance.

TuDou-PK · 2024-07-24T16:33:39Z

same problem for me, do you solve it?

zlqq1001 · 2024-07-31T10:27:13Z

The same problem and it came always in my 3 datasets. First I tried to remove these NA gene's rows of rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]], and everything went well in the step1 of pre-treating data with 0 bugs. But it came a new problem in the beginning of step2, the model training, that I can't load RNA-pp.h5ad files came from step1, just because of the before deletion of these NA rows:
rna = ad.read_h5ad("rna_pp.h5ad")
ValueError: Variables annot. var must have number of columns of X (11787), but has 10215 rows.

Then I tried to fill these "chrom" NA by "chrn" and "chromStart&end" NA by some random number, but it still couldn't work when I ran:
guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)
and the error was: ValueError: Not all features are strand specific!

So how should this problem be handled? Around 10% genes fail to assign the ranges and how to do with these genes NA rows? Thanks for advance.

Jeff1995 · 2024-07-31T15:24:20Z

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;
Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

zlqq1001 · 2024-08-07T07:13:23Z

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;

Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

Thanks for reply. I tried 2 and it did work well!
Here's my codes:
I used 'gencode.v46.chr_patch_hapl_scaff.annotation.gtf.gz' and had 1572 genes which couldn't be matched.
#dims
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 11787 rows

#find NA
na_rows = rna.var.loc[:, ["chromStart", "chromEnd"]].isna().any(axis=1)
rna.var[na_rows].loc[:, ["chrom", "chromStart", "chromEnd"]] #1572 NA rows

delete NA rows

rna.var.dropna(subset=["chromStart", "chromEnd"], inplace=True)
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 10215 rows

#delete NA rows in anondata
rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]

#change the type to int
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].dtypes
rna.var["chromStart"] = rna.var["chromStart"].astype(int)
rna.var["chromEnd"] = rna.var["chromEnd"].astype(int)

oldvalley49 · 2024-08-12T03:09:06Z

thanks for getting back; 1 works well for me. for anyone encountering similar issues in the future: if you are working with data from 10x genomics you can download their reference file from their website.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scglue.data.get_gene_annotation gives many NaN values. #122

scglue.data.get_gene_annotation gives many NaN values. #122

oldvalley49 commented Jul 18, 2024

TuDou-PK commented Jul 24, 2024

zlqq1001 commented Jul 31, 2024

Jeff1995 commented Jul 31, 2024

zlqq1001 commented Aug 7, 2024

oldvalley49 commented Aug 12, 2024

scglue.data.get_gene_annotation gives many NaN values. #122

scglue.data.get_gene_annotation gives many NaN values. #122

Comments

oldvalley49 commented Jul 18, 2024

TuDou-PK commented Jul 24, 2024

zlqq1001 commented Jul 31, 2024

Jeff1995 commented Jul 31, 2024

zlqq1001 commented Aug 7, 2024

delete NA rows

oldvalley49 commented Aug 12, 2024