Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much fewer than expected exons. #2221

Open
EvanKomp opened this issue Sep 30, 2024 · 0 comments
Open

Much fewer than expected exons. #2221

EvanKomp opened this issue Sep 30, 2024 · 0 comments

Comments

@EvanKomp
Copy link

EvanKomp commented Sep 30, 2024

Hello! Thanks for any assistance in advance. I am trying to generate a genome for C. necator. I have tried both genomes here. The resulting GTF files have the expected ~6,000 genes. When I run the genomeGenerate protocol I get 80 exons.

Here is the run command:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles GCA_004798725.1_ASM479872v1_genomic.fna --sjdbGTFfile genomic.gtf --sjdbOverhang 150 --genomeSAindexNbases 5 --sjdbGTFfeatureExon exon

And the head of the input files:

>chr1 Cupriavidus necator H16 chromosome 1, complete sequence
TCTGTCTTGCTGCCGGCCGTGTTTATGCATGTGCAGGTCGCTAACCTGTTAGCAAGGACGGCACGAACAGGCCAGACCCG
GCGATGCGCATACGTGATCCCCTGCCGCATCGCAGAACCCGGCATTTAACCGATTTTCTGCGGCAATGTCAAGGCGCAAG
ACTATGTGCACCGGCACAATTGCCACACCCGTCTGTGGCCCGTGGTGCGCCATTGTCCACAAAGGCCAGTGGGTAACTGC
AAGTGTCCCGAGGGCCACCTCTACCGCATCCATCCACAGGCGAAGTCCCTGTTTTGCCTGAGATTTCCCCCGTTTCCACA
CCCCGGAAAGCCGCATCGCACCAAGGGTTGTGCACACAAATCCACTGCATTTGCACAAGTTATCCACAGACTTATCCTTT
GTGGATAACTCAGGGGCGGGGGTACAATCCGGGTCCAAACACAACACGGGCTGCGTGGCGCAGGCGCAACCCGGGCCCCC
ACCGCGGCAACGCATGACTGAACGCACGCCAGTGCGGCGAGCCTGAAGGCTCGCTTGCTGCTGCATTGGCATGCCGCGGA
GGCCTCACCGAACGGAATTGATTCCTAATACGGTAAGGATGGCCCGGATCGCGCGGCGTAGATGCTGCTTGTGGCGTGCC
GGGGGAACCGCTGATTTGCGGCCGGGTACGATTACCAAAAAACAAACAATGCAAGATTTCTGGCAGGCGGCAGCAGCGCA
#gtf-version 2.2
#!genome-build ASM479872v1
#!genome-build-accession NCBI_Assembly:GCA_004798725.1
#!annotation-date 04/11/2019 14:49:50
#!annotation-source NCBI 
chr1    Genbank gene    689     2455    .       +       .       gene_id "E6A55_00005"; transcript_id ""; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "E6A55_00005"; 
chr1    Protein Homology        CDS     689     2452    .       +       0       gene_id "E6A55_00005"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011614275.1"; locus_tag "E6A55_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QCB99128.1"; transl_table "11"; exon_number "1"; 
chr1    Protein Homology        start_codon     689     691     .       +       0       gene_id "E6A55_00005"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011614275.1"; locus_tag "E6A55_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QCB99128.1"; transl_table "11"; exon_number "1"; 
chr1    Protein Homology        stop_codon      2453    2455    .       +       0       gene_id "E6A55_00005"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_011614275.1"; locus_tag "E6A55_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QCB99128.1"; transl_table "11"; exon_number "1"; 
chr1    Genbank gene    2823    3938    .       +       .       gene_id "E6A55_00010"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "E6A55_00010"; 

Here is the log:|

STAR version=2.7.5a
STAR compilation time,server,dir=Tue Jun 16 16:47:12 GMT 2020 :/Users/travis/build/alexdobin/travis-tests/STARcompile/source
##### Command Line:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /Users/ekomp/Documents/repos/RNA-seq/data/genome --genomeFastaFiles GCA_004798725.1_ASM479872v1_genomic.fna --sjdbGTFfile genomic.gtf --sjdbOverhang 150 --genomeSAindexNbases 5 --sjdbGTFfeatureExon exon
##### Initial USER parameters from Command Line:
###### All USER parameters from Command Line:
runThreadN                    4     ~RE-DEFINED
runMode                       genomeGenerate     ~RE-DEFINED
genomeDir                     /Users/ekomp/Documents/repos/RNA-seq/data/genome     ~RE-DEFINED
genomeFastaFiles              GCA_004798725.1_ASM479872v1_genomic.fna        ~RE-DEFINED
sjdbGTFfile                   genomic.gtf     ~RE-DEFINED
sjdbOverhang                  150     ~RE-DEFINED
genomeSAindexNbases           5     ~RE-DEFINED
sjdbGTFfeatureExon            exon     ~RE-DEFINED
##### Finished reading parameters from all sources

##### Final user re-defined parameters-----------------:
runMode                           genomeGenerate
runThreadN                        4
genomeDir                         /Users/ekomp/Documents/repos/RNA-seq/data/genome
genomeFastaFiles                  GCA_004798725.1_ASM479872v1_genomic.fna   
genomeSAindexNbases               5
sjdbGTFfile                       genomic.gtf
sjdbGTFfeatureExon                exon
sjdbOverhang                      150

-------------------------------
##### Final effective command line:
STAR   --runMode genomeGenerate   --runThreadN 4   --genomeDir /Users/ekomp/Documents/repos/RNA-seq/data/genome   --genomeFastaFiles GCA_004798725.1_ASM479872v1_genomic.fna      --genomeSAindexNbases 5   --sjdbGTFfile genomic.gtf   --sjdbGTFfeatureExon exon   --sjdbOverhang 150
----------------------------------------

Number of fastq files for each mate = 1
Finished loading and checking parameters
--genomeDir directory exists and will be overwritten: /Users/ekomp/Documents/repos/RNA-seq/data/genome/
Sep 30 14:03:07 ... starting to generate Genome files
GCA_004798725.1_ASM479872v1_genomic.fna : chr # 0  "chr1" chrStart: 0
GCA_004798725.1_ASM479872v1_genomic.fna : chr # 1  "chr2" chrStart: 4194304
GCA_004798725.1_ASM479872v1_genomic.fna : chr # 2  "chr3" chrStart: 7340032
Sep 30 14:03:07 ..... processing annotations GTF
Processing pGe.sjdbGTFfile=genomic.gtf, found:
		83 transcripts
		83 exons (non-collapsed)
		0 collapsed junctions
Total junctions: 0
Sep 30 14:03:07 ..... finished GTF processing

Estimated genome size=308864320 = 7864320 + 301000000
GstrandBit=32
Number of SA indices: 14829122
Sep 30 14:03:07 ... starting to sort Suffix Array. This may take a long time...
Number of chunks: 4;   chunks size limit: 39544320 bytes
Sep 30 14:03:07 ... sorting Suffix Array chunks and saving them to disk...
Writing 1639784 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SA_3 ; empty space on disk = 371697975820288 bytes ... done
Writing 39011608 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SA_0 ; empty space on disk = 371697555341312 bytes ... done
Writing 38527992 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SA_1 ; empty space on disk = 371687567654912 bytes ... done
Writing 39453592 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SA_2 ; empty space on disk = 371677703700480 bytes ... done
Sep 30 14:03:08 ... loading chunks from disk, packing SA...
Sep 30 14:03:08 ... finished generating suffix array
Sep 30 14:03:08 ... generating Suffix Array index
Sep 30 14:03:08 ... completed Suffix Array index
Sep 30 14:03:08   Finished preparing junctions
Sep 30 14:03:08 ..... finished inserting junctions into genome
Sep 30 14:03:08 ... writing Genome to disk ...
Writing 7864320 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//Genome ; empty space on disk = 371699988037632 bytes ... done
SA size in bytes: 61170132
Sep 30 14:03:08 ... writing Suffix Array to disk ...
Writing 61170132 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SA ; empty space on disk = 371713635254272 bytes ... done
Sep 30 14:03:08 ... writing SAindex to disk
Writing 8 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SAindex ; empty space on disk = 371697976868864 bytes ... done
Writing 48 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SAindex ; empty space on disk = 371697976868864 bytes ... done
Writing 5971 bytes into /Users/ekomp/Documents/repos/RNA-seq/data/genome//SAindex ; empty space on disk = 371697976868864 bytes ... done
Sep 30 14:03:08 ..... finished successfully
DONE: Genome generation, EXITING

I have beat my head against a couple of parameters and cannot get the genome to come out with more that ~80 transcripts. Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant