Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to determine the number of splits used for Diamond annotation? #88

Open
jolespin opened this issue Nov 30, 2023 · 1 comment

Comments

@jolespin
Copy link

Let's say someone ran the following command:

checkm2 predict  --genes --threads 16 --database_path uniref100.KO.1.dmnd --output_directory checkm2_output  --input genomes/ -x faa

In checkm2_output/diamond_output there are several diamond output files. Is there way I can compute the expected number of diamond files based on the number genomes and/or genes in each fasta file?

@jolespin
Copy link
Author

jolespin commented Dec 1, 2023

I'm seeing this bit here:

    def run(self, protein_files):

        
        logging.info('Annotating input genomes with DIAMOND using {} threads'.format(self.threads))
        
        if len(protein_files) <= self.chunksize:
            protein_chunks = self.__concatenate_proteins(protein_files)
            diamond_out = os.path.join(self.diamond_out, "DIAMOND_RESULTS.tsv")
            self.__call_diamond(protein_chunks, diamond_out)
                        
        else:
            #break file list into chunks of size 'chunksize'
            chunk_list = [protein_files[i:i + self.chunksize] for i in range(0, len(protein_files), self.chunksize)]
            
            for number, chunk in enumerate(chunk_list):
                diamond_out = os.path.join(self.diamond_out, "DIAMOND_RESULTS_{}.tsv".format(number))
                self.__call_diamond(self.__concatenate_proteins(chunk), diamond_out)

        
        diamond_out_list = [x for x in os.listdir(self.diamond_out) if x.startswith('DIAMOND_RESULTS')]
        if len(diamond_out_list) == 0:
            logging.error("Error: DIAMOND failed to generate output.")
            sys.exit(1)
        else:
            return diamond_out_list

Could be REALLY useful to have something like this:

    from tqdm import tqdm
    def run(self, protein_files):

        
        logging.info('Annotating input genomes with DIAMOND using {} threads'.format(self.threads))
        
        if len(protein_files) <= self.chunksize:
            protein_chunks = self.__concatenate_proteins(protein_files)
            diamond_out = os.path.join(self.diamond_out, "DIAMOND_RESULTS.tsv")
            self.__call_diamond(protein_chunks, diamond_out)
                        
        else:
            #break file list into chunks of size 'chunksize'
            chunk_list = [protein_files[i:i + self.chunksize] for i in range(0, len(protein_files), self.chunksize)]
            
            for number, chunk in tqdm(enumerate(chunk_list), desc="Diamond annotation on chunks", total=len(chunk_list)):
                diamond_out = os.path.join(self.diamond_out, "DIAMOND_RESULTS_{}.tsv".format(number))
                self.__call_diamond(self.__concatenate_proteins(chunk), diamond_out)

        
        diamond_out_list = [x for x in os.listdir(self.diamond_out) if x.startswith('DIAMOND_RESULTS')]
        if len(diamond_out_list) == 0:
            logging.error("Error: DIAMOND failed to generate output.")
            sys.exit(1)
        else:
            return diamond_out_list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant