New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

StevenCannon-USDA · 2024-06-28T18:32:39Z

StevenCannon-USDA · 2024-07-15T21:17:08Z

This one is back in play, following our discussion about handling haplotype-resolved assemblies.

adf-ncgr · 2024-07-18T15:10:43Z

@StevenCannon-USDA should have the AHRDs on these two completed soon and will move from annex to main datastore. My preference would be to move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. But wanted to confirm with you since I think originally you were planning to leave secondary haplotypes in the annex. Also one very minor note, it seems that the procedure you're using for the upstream processing is producing uncompressed gff3 for the gene_models_main files, although they have the .gz suffix. Not really a problem since we have to add the AHRD stuff in and redo compression/indexing but it is a bit confusing when gunzip complains...

StevenCannon-USDA · 2024-07-18T15:36:12Z

move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems.
I agree now that moving them both to the main Data Store is best.

Thanks for the alert about the uncompressed GFF3s. I suspect that was due to some additional manual stuff I did when the automated compression failed (I think) due to an interrupted session.

adf-ncgr · 2024-07-18T16:57:23Z

OK, the data content related tasks (AHRD/BUSCO/gfa) should be complete and I've moved the folders into the main datastore; downstream steps will proceed as time permits but if there's any you consider higher priority than others let me know.

Regarding the compression, it definitely was an issue on both haplotypes and I feel like I've seen it before but not %100 sure about that. In any case if I see it again I'll let you know.

StevenCannon-USDA · 2024-07-18T17:01:19Z

OK, thank you.

I'll also investigate the compression issue -- at least next time I run the process.
The script responsible should be
/usr/local/www/data/datastore-specifications/scripts/compress_and_index.sh
and the code in question is:

for file in $filepath/*.f?a $filepath/*.gff3 $filepath/*tsv $filepath/*bed; do
  if test -f $file; then
    echo "Compressing $file"
    bgzip -l9 $file &
  fi
done
wait

adf-ncgr · 2024-07-18T17:35:55Z

well that looks pretty straightforward- but now that I think about it some more I don't think an interrupted session would explain the observed behavior which is as if the original file were simply renamed with a .gz suffix. Is it possible that there's something else that just names it with a gz extension (in which case the code above wouldn't even see it there)?

StevenCannon-USDA · 2024-07-19T14:04:21Z

"Is it possible that there's something else that just names it with a gz extension"

Helpful suggestion/clue. You are right.
Here's the source of the problem. In the ds_souschef.pl configs chafa.ISC494698.gnm1_hap2.ann1.yml and chafa.ISC494698.gnm1.ann1.yml, the "to" suffix was given as gene_models_main.gff3.gz, but it should have been just gene_models_main.gff3, since the output is not gzipped by ds_souschef.pl.

  - 
    from: gene_strip.gff3.gz
    to: gene_models_main.gff3.gz
    description: "Gene models - main"

I'll plan to add checks for this in ds_souschef.pl once I've finished some other tasks.

StevenCannon-USDA assigned adf-ncgr and StevenCannon-USDA Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

StevenCannon-USDA commented Jun 28, 2024 •

edited by adf-ncgr

Loading

StevenCannon-USDA commented Jul 15, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 18, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 18, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 19, 2024

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

Comments

StevenCannon-USDA commented Jun 28, 2024 • edited by adf-ncgr Loading

Main steps for adding new genome and annotation collections

Genus/species/collection names:

StevenCannon-USDA commented Jul 15, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 18, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 18, 2024

adf-ncgr commented Jul 18, 2024

StevenCannon-USDA commented Jul 19, 2024

StevenCannon-USDA commented Jun 28, 2024 •

edited by adf-ncgr

Loading