Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom refgen generation (scRNA) #2234

Open
radiasso opened this issue Nov 5, 2024 · 0 comments
Open

Custom refgen generation (scRNA) #2234

radiasso opened this issue Nov 5, 2024 · 0 comments

Comments

@radiasso
Copy link

radiasso commented Nov 5, 2024

Hi, I need to generate a reference genome with 4 additional genes (4 fluo proteins) from mouse refgen mm10.
I already had a working refgen with them but with a previous version of STAR, so now I have to re-generate it (STAR 2.7.11b).

Following instructions, I added the 4 genes at the end of gtf file, tab separated:

hrGFPIINLS      unknown exon    1       798     .       +       .       gene_id "hrGFPIINLS"; transcript_id "hrGFPIINLS"; gene_name "hrGFPIINLS"; gene_biotype protein_coding
EYFP    unknown exon    1       720     .       +       .       gene_id "EYFP"; transcript_id "EYFP"; gene_name "EYFP"; gene_biotype protein_coding
tdimer2 unknown exon    1       1395    .       +       .       gene_id "tdimer2"; transcript_id "tdimer2"; gene_name "tdimer2"; gene_biotype protein_coding
MbmCerulean     unknown exon    1       825     .       +       .       gene_id "MbmCerulean"; transcript_id "MbmCerulean"; gene_name "MbmCerulean"; gene_biotype protein_coding

and added the respective sequences at the end of the fasta file (.fa), like:

>hrGFPIINLS dna:plasmid
ATGGTGAGCAAGCAGATCCTGAAGAACACCGGCCTGCAGGAGATCATGAGCTTCAAGGTG...
>EYFP dna:plasmid
TTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAACTCCAGCAG...
>tdimer2 dna:plasmid tdimer2
ATGGTGGCCTCCTCCGAGGACGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAG...
>MbmCerulean dna:plasmid
TTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAACTCCAGCAG...

and generated the refgen with:

STAR --runThreadN 27 \
     --runMode genomeGenerate \
     --genomeDir ./STAR_index \
     --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly_FLUO.fa\
     --sjdbGTFfile genes_FLUO.gtf \
     --sjdbOverhang 100

It generates the refgen without errors, same with the alignment done (as always) with:
STAR --genomeDir=./STAR_index --readFilesIn=R2_001.fastq.gz, R2_001.fastq.gz --runThreadN=12 --soloType Droplet --soloCBwhitelist mylist.txt --soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts --soloUMIlen 12 --sjdbGTFfile=genes_FLUO.gtf --readFilesCommand zcat

The problem arises when I load my features, barcodes and matrix to create an anndata object:
ValueError: Length of values (31057) does not match length of index (31053)
As if those 4 genes are not actually indexed, maybe?

What am I doing wrong?
Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant