Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address species that have too long filenames because of accidentally using the long gene name form #21

Open
johrstrom opened this issue Nov 8, 2021 · 2 comments

Comments

@johrstrom
Copy link
Contributor

johrstrom commented Nov 8, 2021

When attempting to generate a tarball of all the data, this caused a problem:

/usr/local/ruby/2.5.5/lib/ruby/2.5.0/rubygems/package/tar_writer.rb:330:in `split_name': File "phylogatr-results/Bacillariophyceae/Bacillariales/Bacillariaceae/Pseudo-nitzschia-mannii/Pseudo-nitzschia-mannii-RIBULOSE-1,5-BISPHOSPHATE-CARBOXYLASE-OXYGENASE-LARGE-SUBUNIT-(RBCL)-GENE.afa" has a too long name (should be 100 or less) (Gem::Package::TooLongFileName)

The problem is Pseudo-nitzschia-mannii-RIBULOSE-1,5-BISPHOSPHATE-CARBOXYLASE-OXYGENASE-LARGE-SUBUNIT-(RBCL)-GENE.afa where this is obviously not the short form of the gene name: RIBULOSE-1,5-BISPHOSPHATE-CARBOXYLASE-OXYGENASE-LARGE-SUBUNIT-(RBCL)-GENE. That short form is probably RBCL.

@johrstrom
Copy link
Contributor Author

This particular file is still problematic.

The actual file contents is tiny, so you can see the source accession and gbif record ids for help in fixing the problem:

>LR594647_2286502301
gtttatctggtaaaaactacggtncgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcntcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtcaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaancatgtggtccattacaaacagctttagatttatggaaagatattagttttanctacacatctacagatacagctgatttcgctgtaacaccaactgcaaac
>LR594648_2286508931
tcaatgtctcaatctgtatcagaacggactcgaattaaaagtgaccgttacgaatctggtgtaatcccttatgctaaaatgggttactgggatgcttcatacgcagtaaaaactactgatgttcttgctttattccgtatcacaccacaaccaggcgtagatcctgtagaagctgctgctgcagtagctggtgaatcttcaacagcaacttggacagttgtatggactgatttattaacagcttgtgaccgttaccgtgctaaagcttaccgtgtagatccagttccaaacacatcagatgtattctttgctttcatcgcatacgaatgtgatttatttgaagaaggttctttatcgaacttaacagcatctatcattggtaacgtattcggttttaaagctgtatcagctttacgtttagaagacatgcgtattcctcactcatacttaaaaacattccaaggtcctgcnacaggtatcgttgtagaacgtgaacgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcgtcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtaaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaaacagct
>LR594649_2286502346
gttgtagaacgtgaacgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcgtcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtaaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaa
>LR594651_2286502439
tctgtatcagaacggactcgaattaaaagtgaccgttacgaatctggtgtaatcccttatgctaaaatgggttactgggatgcttcatacgcagtaaaaactactgatgttcttgctttattccgtatcacaccacaaccaggcgtagatcctgtagaagctgctgctgcagtagctggtgaatcttcaacagcaacttggacagttgtatggactgatttattaacagcttgtgaccgttaccgtgctaaagcttaccgtgtagatccagttccaaacacatcagatgtattctttgctttcatcgcatacgaatgtgatttatttgaagaaggttctttatcgaacttaacagcatctatcattggtaacgtattcggttttaaagctgtatcagctttacgtttagaagacatgcgtattcctcactcatacttaaaaacattccaaggtcctgcaacaggtatcgttgtagaacgtgaacgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcgtcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtaaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaaacagcttta
>LR594652_2286512821
caatctgtatcagaacggactcgaattaaaagtgaccgttacgaatctggtgtaatcccttatgctaaaatgggttactgggatgcttcatacgcagtaaaaactactgatgttcttgctttattccgtatcacaccacaaccaggcgtagatcctgtagaagctgctgctgcagtagctggtgaatcttcaacagcaacttggacagttgtatggactgatttattaacagcttgtgaccgttaccgtgctaaagcttaccgtgtagatccagttccaaacacatcagatgtattctttgctttcatcgcatacgaatgtgatttatttgaagaaggttctttatcgaacttaacagcatctatcattggtaacgtattcggttttaaagctgtatcagctttacgtttagaagacatgcgtattcctcactcatacttaaaaacattccaaggtcctgcaacaggtatcgttgtagaacgtgaacgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcgtcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtaaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaaacagctttagatttatggaaagatattagttttaactacacatctacagataca
>LR594653_2286502078
cgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcgtcgttatgatcgatttagttatgggttacactgcaatccaaagtactgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatacggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtaaagactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaaacagct
>LR594654_2286510307
acaggtatcgttgtagaacgtgaacgtttaaataaatatggtactcctttattaggtgctacagtaaaaccaaaattaggtttatctggtaaaaactacggtcgtgtagtattcgaaggtttaaaaggtggtttagacttcttaaaagatgatgaaaacattaactctcaaccattcatgcgttggagagagcgtttcttaaactgtatggaaggtattaaccgtgcatctgctgctacaggagaagtaaaaggttcttacttaaacgttacagctgcaactatggaagaagttatcaaacgttgtgaatatgctaaagaagtaggttctatcatcgttatgatcgatttagttatgggttacactgcaatccaaagtgctgcatactgggctcgtgaaaacgatatgcttttacatttacaccgtgctggtaactctacatacgcacgtcaaaagaaccacggtattaacttccgtgtaatctgtaaatggatgcgtatgtctggtgtagatcatatccacgctggaacagttgtaggtaaattagaaggtgatcctttaatgattaaaggtttctacgatgttttacgtttaactcacttagaagttaacttaccatttggtattttcttcgaaatgccttgggctagtttacgtcgttgtatgccggtagcatctggtggtattcactgtggtcaaatgcaccaattaattcactacttaggtgatgatgtagtattacaatttggtggtggtacaatcggtcacccagatggtattcaagccggtgctacagctaaccgtgttgctttagaagcaatggtattagctcgtaacgaaggtcacgactacttcagtccagaagttggtccacaaatcttacgtaatgccgctaaaacatgtggtccattacaaacagct

@johrstrom
Copy link
Contributor Author

Though this won't affect production since the result is this file is not aligned, so the species is ignored:

 #<Species:0x0000558427576510
  id: 77455,
  path: "Bacillariophyceae/Bacillariales/Bacillariaceae/Pseudo-nitzschia-mannii",
  total_seqs: 15,
  total_bytes: 18744,
  aligned: false,
  taxon_kingdom: "Chromista",
  taxon_phylum: "Ochrophyta",
  taxon_class: "Bacillariophyceae",
  taxon_order: "Bacillariales",
  taxon_family: "Bacillariaceae",
  taxon_genus: "Pseudo-nitzschia",
  taxon_species: "Pseudo-nitzschia mannii",
  taxon_subspecies: "",
  different_genbank_species: nil>]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant