Failure running my ML workflows #1115

kxk302 · 2021-05-13T13:36:45Z

I have 3 workflows that use Galaxy's ML tools (namely Keras for neural networks). They all worked fine last time I ran them (maybe a month ago?).

These 3 workflows are used in 3 neural network tutorials that I am presenting at GCC 2021. I decided to re-run them to make sure all is good. All 3 workflows fail now. Here is the error message for the first 2 workflows:

Traceback (most recent call last):
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 491, in <module>
    targets=args.targets, fasta_path=args.fasta_path)
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 405, in main
    estimator.fit(X_train, y_train)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 911, in fit
    return super(KerasGRegressor, self)._fit(X, y, **kwargs)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 644, in _fit
    validation_data = self.validation_data

Here are the histories:

Per @anuprulez' suggestion, I downgraded the tool versions and the first and second workflow work now. Below is the downgrade:

Create a deep learning model architecture: downgraded to 0.4.2
Create a deep learning model with an optimizer, loss function and fit parameters: downgraded 0.4.2
Deep learning training and evaluation conduct deep training and evaluation either implicitly or explicitly: downgraded to 1.0.8.2

The third workflow still fails. BTW, it requires the most recent version of the third tool.

I started writing unit tests in galaxytools (https://github.com/kxk302/galaxytools/tree/nn_tests), so these workflows are run as part of the unit test. They would serve as regression tests and would guarantee future changes would not break old code. However, I ran into another issue: models saved to file cannot be loaded and error out. Not sure if this is related to the workflow error above. Here is the error message:

unzip cnn.zip
Archive: cnn.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of cnn.zip or
cnn.zip.zip, and cannot find cnn.zip.ZIP, period.

The text was updated successfully, but these errors were encountered:

kxk302 · 2021-05-13T13:42:51Z

@anuprulez I just ran my third workflow (CNN workflow) on galaxy.eu and it failed. Could you please check the log to see what error message we get? Thanks.

kxk302 · 2021-05-13T13:43:40Z

I only see "Failed to communicate with remote job server."

anuprulez · 2021-05-13T13:49:09Z

@kxk302 in the first and third histories, I don't have permission to see those datasets. Can you unlock those?

kxk302 · 2021-05-13T13:57:20Z

Update: I re-ran the third history after the initial failure and it completed successfully.

@anuprulez how do I unlock the datasets? I don't see an option when trying to share history. If you want we can use Gitter to resolve this. Thx

anuprulez · 2021-05-13T13:58:17Z

I see some changes have been made to: https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml very recently

kxk302 · 2021-05-13T14:02:15Z

https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml

Yes, there was a bug fix in Galaxy-ML that was pushed recently.

kxk302 · 2021-05-13T14:03:27Z

Here are the links to all workflows and datasets for histories:

First history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/FNN/workflows/
https://zenodo.org/record/4660497/files/X_test.tsv
https://zenodo.org/record/4660497/files/X_train.tsv
https://zenodo.org/record/4660497/files/y_test.tsv
https://zenodo.org/record/4660497/files/y_train.tsv

kxk302 · 2021-05-13T14:04:04Z

Second history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/RNN/workflows/
https://zenodo.org/record/4477881/files/X_test.tsv
https://zenodo.org/record/4477881/files/X_train.tsv
https://zenodo.org/record/4477881/files/y_test.tsv
https://zenodo.org/record/4477881/files/y_train.tsv

kxk302 · 2021-05-13T14:04:45Z

Third history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/workflows/
https://zenodo.org/record/4697906/files/X_train.tsv
https://zenodo.org/record/4697906/files/y_train.tsv
https://zenodo.org/record/4697906/files/X_test.tsv
https://zenodo.org/record/4697906/files/y_test.tsv

kxk302 · 2021-05-13T14:05:31Z

You need to re-name the uploaded files and change their type to tabular, before running the workflows. Thx.

anuprulez · 2021-05-13T14:43:30Z

Second history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/RNN/workflows/
https://zenodo.org/record/4477881/files/X_test.tsv
https://zenodo.org/record/4477881/files/X_train.tsv
https://zenodo.org/record/4477881/files/y_test.tsv
https://zenodo.org/record/4477881/files/y_train.tsv

I get these errors while running this workflow

kxk302 · 2021-05-13T14:45:00Z

@anuprulez did you downgrade the tool versions in the RNN workflow?

anuprulez · 2021-05-13T15:04:04Z

No, I just ran it

…

On Thu, May 13, 2021, 8:15 PM kxk302 ***@***.***> wrote: @anuprulez <https://github.com/anuprulez> did you downgrade the tool versions in the RNN workflow? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1115 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXB5NQQ7XM3ZTQDEWAKSYLTNPQYHANCNFSM442SE2IA> .

kxk302 · 2021-05-13T15:05:16Z

If you downgrade the tool versions as I documented, it will work.

kxk302 · 2021-05-13T15:19:09Z

I guess the questions is why it stopped working with the new versions of those tools.

qiagu · 2021-05-13T15:51:44Z

Try to check various package versions in the conda environment, python version as well (make sure python 3.6). The conda includes a lot of members, prone to make errors when a newer package joins the team.

kxk302 · 2021-05-13T16:42:36Z

Thanks @qiagu,

Could you please provide more info on how to do that?

qiagu · 2021-05-13T17:26:27Z

Sorry, I just say a general debugging process, not specific to any issue mentioned in this thread. From the stderr report @anuprulez provided, I feel the errors could be cleared by re-cleaning the input TSVs.

qiagu · 2021-05-13T17:40:41Z

Try to ensure the classification targets are integers, not float.

kxk302 · 2021-05-13T17:58:35Z

I do not see the errors that Anup sees. I guess the first step would be to get these workflows working with older versions of the tools. Then we can use the new version to re-produce the problem. @anuprulez not sure what your internet connectivity is like, but we could possibly have a Zoom meeting to discuss tomorrow (Friday). I'm free from 8:00 am to 10:100 am EST time.

mvdbeek · 2021-05-13T19:02:47Z

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

kxk302 · 2021-05-13T20:09:18Z

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

This is run on EU. I remember vaguely Bjorn saying that some jobs are configured to run on GPU and this error would show up then, and the error would go away when job was run on CPU. Am I right @bgruening?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure running my ML workflows #1115

Failure running my ML workflows #1115

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021 via email

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

qiagu commented May 13, 2021 •

edited

Loading

kxk302 commented May 13, 2021

qiagu commented May 13, 2021

qiagu commented May 13, 2021

kxk302 commented May 13, 2021 •

edited

Loading

mvdbeek commented May 13, 2021

kxk302 commented May 13, 2021

Failure running my ML workflows #1115

Failure running my ML workflows #1115

Comments

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021

kxk302 commented May 13, 2021

anuprulez commented May 13, 2021 via email

kxk302 commented May 13, 2021

kxk302 commented May 13, 2021

qiagu commented May 13, 2021 • edited Loading

kxk302 commented May 13, 2021

qiagu commented May 13, 2021

qiagu commented May 13, 2021

kxk302 commented May 13, 2021 • edited Loading

mvdbeek commented May 13, 2021

kxk302 commented May 13, 2021

qiagu commented May 13, 2021 •

edited

Loading

kxk302 commented May 13, 2021 •

edited

Loading