From 202be4fdb357666c613a10f145969971d86ec850 Mon Sep 17 00:00:00 2001 From: ErinWeisbart <54687786+ErinWeisbart@users.noreply.github.com> Date: Tue, 6 Aug 2024 10:00:20 -0700 Subject: [PATCH] clarify language --- .../DCP-documentation/troubleshooting_runs.md | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/documentation/DCP-documentation/troubleshooting_runs.md b/documentation/DCP-documentation/troubleshooting_runs.md index 77589e9..f12340c 100644 --- a/documentation/DCP-documentation/troubleshooting_runs.md +++ b/documentation/DCP-documentation/troubleshooting_runs.md @@ -2,22 +2,22 @@ | SQS | Cloudwatch | S3 | EC2/ECS | Problem | Solution | |---|---|---|---|---|---| -| Messages in flight consistently < number of dockers running | CP never progresses beyond a certain module| | | CP is stalling indefinitely on a step without throwing an error. This means there is a bug in CP. | The module that is stalling is the one after the last module that got logged. Check the Issues in the CP Github repo for reports of problems with a certain module. If you don’t see a report, make one. Use different settings within the module to avoid the bug or use a different version of DCP with the bug fixed. | -| Jobs completing (total messages decreasing) much more quickly than expected. | "File not run due to > expected number of files"| | | CHECK_IF_DONE_BOOL is being triggered because the output folder for your job already has >= EXPECTED_NUMBER_OF_FILES. | If you want to overwrite previous runs, in your config, change CHECK_IF_DONE_BOOL to TRUE. If using the CHECK_IF_DONE_BOOL option to avoid reprocessing old jobs, make sure to account for any files that may already exist in the output folder. i.e. if your pipeline creates 5 files, but there are already 6 files in your output folder, make sure to set the EXPECTED_NUMBER_FILES to 11 (6+5), not 5.| -| Jobs completing (total messages decreasing) much more quickly than expected. | “== OUT” without proceeding through CP pipeline | Batch_data.h5 files being created instead of expected output. | | Your pipeline has the CreateBatchFiles module included. | Uncheck the CreateBatchFiles module in your pipeline. | +| Messages in flight consistently < number of dockers running | CP never progresses beyond a certain module | | | CP is stalling indefinitely on a step without throwing an error. This means there is a bug in CP. | The module that is stalling is the one after the last module that got logged. Check the Issues in the CP Github repo for reports of problems with a certain module. If you don’t see a report, make one. Use different settings within the module to avoid the bug or use a different version of DCP with the bug fixed. | +| Jobs completing (total messages decreasing) much more quickly than expected. | "File not run due to > expected number of files" | | | CHECK_IF_DONE_BOOL is being triggered because the output folder for your job already has >= EXPECTED_NUMBER_OF_FILES. | If you want to overwrite previous runs, in your config, change CHECK_IF_DONE_BOOL to TRUE. If using the CHECK_IF_DONE_BOOL option to avoid reprocessing old jobs, make sure to account for any files that may already exist in the output folder. i.e. if your pipeline creates 5 files, but there are already 6 files in your output folder, make sure to set the EXPECTED_NUMBER_FILES to 11 (6+5), not 5. | +| Jobs completing (total messages decreasing) much more quickly than expected. | “== OUT” without proceeding through CP pipeline | Batch_data.h5 files being created instead of expected output. | | Your pipeline has the CreateBatchFiles module included. | Uncheck the CreateBatchFiles module in your pipeline. | | | "ValueError: dictionary update sequence element #1 has length 1; 2 is required" | | | The syntax in the groups section of your job file is incorrect. | If you are grouping based on multiple variables, make sure there are no spaces between them in your listing in your job file. e.g. "Metadata_Plate=Plate1,Metadata_Well=A01" is correct, "Metadata_Plate=Plate1, Metadata_Well=A01" is incorrect. | -| | Nothing happens for a long time after "cellprofiler -c -r "| | | 1) Your input directory is set to a folder with a large number of files and CP is trying to read the whole directory before running. 2) You are loading very large images. | 1) In your job file, change the input to a smaller folder. 2) Consider downscaling your images before running them in CP. Or just be more patient.| -| | Within a single log there are multiple “cellprofiler -c -r” | Expected output seen. | | A single job is being processed multiple times. | SQS_MESSAGE_VISIBILITY is set too short. See [SQS_Queue_information](SQS_QUEUE_information.md) for more information. | +| | Nothing happens for a long time after "cellprofiler -c -r " | | | 1) Your input directory is set to a folder with a large number of files and CP is trying to read the whole directory before running. 2) You are loading very large images. | 1) In your job file, change the input to a smaller folder. 2) Consider downscaling your images before running them in CP. Or just be more patient. | +| | Within a single log there are multiple “cellprofiler -c -r” | Expected output seen. | | A single job is being processed multiple times. | SQS_MESSAGE_VISIBILITY is set too short. See [SQS_Queue_information](SQS_QUEUE_information.md) for more information. | | | “ValueError: no name (Invalid arguments to routine: Bad value)” or “Encountered unrecoverable error in LoadData during startup: No name (no name)” | | | There is a problem with your LoadData.csv. This is usually seen when CSVs are created with a script; accidentally having an extra comma somewhere (looks like ",,") will be invisible in Excel but generate the CP error. If you made your CSVs with pandas to_csv option, you must pass index=False or you will get this error. | Find the “,,” in your CSV and remove it. If you made your CSVs with pandas dataframe’s to_csv function, check to make sure you used the index=False parameter. | -| | IndexError: index 0 is out of bounds for axis 0 with size 0| | | 1) Metadata values of 0 OR that have leading zeros (ie Metadata_Site=04, rather than Metadata_Site=4) are not handled well by CP. 2) The submitted jobs don’t make sense to CP. 3) DCP is looking for your images in the wrong location. | 1) Change your LoadData.csv so that there are no Metadata values of 0 or with 0 padding. 2) Change your job file so that your jobs match your pipeline’s expected input. 3) If using LoadData, make sure the file paths are correct in your LoadData.csv and the "Base image location" is set correctly in the LoadData module. If using BatchFiles, make sure your BatchFile paths are correct. | -| | | Pipeline output is not where expected | | 1) There is a mistake in your ExportToSpreadsheet in your pipeline. 2) There is a mistake in your job file. | 1) Check that your Output File Location is as expected. Default Output Folder is typical. Default Output Folder sub-folder can cause outputs to be nested in an unusual manner. 2) Check the output path in your job file. | -| | "Empty image set list: no images passed the filtering criteria." | | |DCP doesn’t know how to load your image set.| If you are using a .cppipe and LoadData.csv, make sure that your pipeline includes the LoadData module. | -| Jobs completing(total messages decreasing) much more quickly than expected. |"==OUT, SUCCESS"| No outcome/saved files on s3 | | There is a mismatch in your metadata somewhere. |Check the Metadata_ columns in your LoadData.csv for typos or a mismatch with your jobs file. The most common sources of mismatch are case and zero padding (e.g. A01 vs a01 vs A1). Check for these mismatches and edit the job file accordingly. If you use pe2loaddata to create your csvs and the plate was imaged multiple times, pay particular attention to the Metadata_Plate column as numbering reflecting this will be automatically passed into the Load_data.csv | -| | Your specified output structure does not match the Metadata passed. |Expected output is seen.| | This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the Cloudwatch log that matches the input structure but actual job progress will print in the Cloudwatch log that matches the output structure. | | -| | Your perinstance logs have an IOError indicating that an .h5 batchfile does not exist | No outcome/saved files on s3 | | No batchfiles exist for your project. | Either you need to create the batch files and make sure that they are in the appropriate directory OR re-start and use MakeAnalysisJobs() instead of MakeAnalysisJobs(mode=‘batch’) in run_batch_general.py | -| | | | Machines made in EC2 and dockers are made in ECS but the dockers are not placed on the machines | 1) There is a mismatch in your DCP config file. OR 2) You haven't set up permissions correctly. | 1) Confirm that the MEMORY matches the MACHINE_TYPE set in your config. Confirm that there are no typos in your DOCKERHUB_TAG set in your config. 2) Check that you have set up permissons correctly for the user or role that you have set in your config under AWS_PROFILE. Confirm that your `ecsInstanceRole` is able to access the S3 bucket where your `ecsconfigs` have been uploaded. | +| | IndexError: index 0 is out of bounds for axis 0 with size 0 | | | 1) Metadata values of 0 OR that have leading zeros (ie Metadata_Site=04, rather than Metadata_Site=4) are not handled well by CP. 2) The submitted jobs don’t make sense to CP. 3) DCP is looking for your images in the wrong location. | 1) Change your LoadData.csv so that there are no Metadata values of 0 or with 0 padding. 2) Change your job file so that your jobs match your pipeline’s expected input. 3) If using LoadData, make sure the file paths are correct in your LoadData.csv and the "Base image location" is set correctly in the LoadData module. If using BatchFiles, make sure your BatchFile paths are correct. | +| | | Pipeline output is not where expected | | 1) There is a mistake in your ExportToSpreadsheet in your pipeline. 2) There is a mistake in your job file. | 1) Check that your Output File Location is as expected. Default Output Folder is typical. Default Output Folder sub-folder can cause outputs to be nested in an unusual manner. 2) Check the output path in your job file. | +| | "Empty image set list: no images passed the filtering criteria." | | | DCP doesn’t know how to load your image set.| If you are using a .cppipe and LoadData.csv, make sure that your pipeline includes the LoadData module. | +| Jobs completing(total messages decreasing) much more quickly than expected. |"==OUT, SUCCESS"| No outcome/saved files on s3 | | There is a mismatch in your metadata somewhere. | Check the `Metadata_` columns in your LoadData.csv for typos or a mismatch with your jobs file. The most common sources of mismatch are case and zero padding (e.g. A01 vs a01 vs A1). Check for these mismatches and edit the job file accordingly. If you use pe2loaddata to create your csvs and the plate was imaged multiple times, pay particular attention to the Metadata_Plate column as numbering reflecting this will be automatically passed into the Load_data.csv | +| | Your specified output structure does not match the Metadata passed. | Expected output is seen.| | This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the Cloudwatch log that matches the input structure but actual job progress will print in the Cloudwatch log that matches the output structure. | | +| | Your perinstance logs have an IOError indicating that an .h5 batchfile does not exist | No outcome/saved files on s3 | | No batchfiles exist for your project. | Either you need to create the batch files and make sure that they are in the appropriate directory OR re-start and use MakeAnalysisJobs() instead of MakeAnalysisJobs(mode=‘batch’) in run_batch_general.py | +| | | | Machines made in EC2 but they remain nameless. | A nameless machine means that the Dockers are not placed on the machines. 1) There is a mismatch in your DCP config file. OR 2) You haven't set up permissions correctly. OR 3) Dockers are not being made in ECS | 1) Confirm that the MEMORY matches the MACHINE_TYPE set in your config. Confirm that there are no typos in your DOCKERHUB_TAG set in your config. 2) Check that you have set up permissions correctly for the user or role that you have set in your config under AWS_PROFILE. Confirm that your `ecsInstanceRole` is able to access the S3 bucket where your `ecsconfigs` have been uploaded. 3) Check in ECS that you see `Registered container instances`. | | | Your perinstance logs have an IOError indicating that CellProfiler cannot open your pipeline | | | You have a corrupted pipeline. | Check if you can open your pipeline locally. It may have been corrupted on upload or it may have an error within the pipeline itself. | -| |"== ERR move failed:An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate." Error may not show initially and may become more prevalent with time. | | | Too many jobs are finishing too quickly creating a backlog of jobs waiting to upload to S3. | You can 1) check out fewer machines at a time, 2) check out smaller machines and run fewer copies of DCP at the same time, or 3) group jobs in larger groupings (e.g. by Plate instead of Well or Site). If this happens because you have many jobs finishing at the same time (but not finishing very rapidly such that it's not creating an increasing backlog) you can increase SECONDS_TO_START in config.py so there is more separation between jobs finishing.| +| | "== ERR move failed:An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate." Error may not show initially and may become more prevalent with time. | | | Too many jobs are finishing too quickly creating a backlog of jobs waiting to upload to S3. | You can 1) check out fewer machines at a time, 2) check out smaller machines and run fewer copies of DCP at the same time, or 3) group jobs in larger groupings (e.g. by Plate instead of Well or Site). If this happens because you have many jobs finishing at the same time (but not finishing very rapidly such that it's not creating an increasing backlog) you can increase SECONDS_TO_START in config.py so there is more separation between jobs finishing. | | | "/home/ubuntu/bucket: Transport endpoint is not connected" | Cannot be accessed by fleet. | | S3FS has stochastically dropped/failed to connect. | Perform your run without using S3FS by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded. | | | "SSL: certificate subject name (*.s3.amazonaws.com) does not match target host name 'xxx.yyy.s3.amazonaws.com'" | Cannot be accessed by fleet. | | S3FS fails to mount if your bucket name has a dot (.) in it. | You can bypass S3FS usage by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded. Alternatively, you can make your own DCP Docker and edit run-worker.sh to `use_path_request_style`. If your region is not us-east-1 you also need to specify `endpoint`. See S3FS documentation for more information. | | | Your logs show that files are downloading but it never moves beyond that point. | | | If you have set DOWNLOAD_FILES = TRUE in your config, then your files are failing to completely download because you are running out of space and it is failing silently. | Place larger volumes on your instances by increasing EBS_VOL_SIZE in your config.py |