Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure running my ML workflows #1115

Open
kxk302 opened this issue May 13, 2021 · 22 comments
Open

Failure running my ML workflows #1115

kxk302 opened this issue May 13, 2021 · 22 comments

Comments

@kxk302
Copy link
Contributor

kxk302 commented May 13, 2021

I have 3 workflows that use Galaxy's ML tools (namely Keras for neural networks). They all worked fine last time I ran them (maybe a month ago?).

These 3 workflows are used in 3 neural network tutorials that I am presenting at GCC 2021. I decided to re-run them to make sure all is good. All 3 workflows fail now. Here is the error message for the first 2 workflows:

Traceback (most recent call last):
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 491, in <module>
    targets=args.targets, fasta_path=args.fasta_path)
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 405, in main
    estimator.fit(X_train, y_train)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 911, in fit
    return super(KerasGRegressor, self)._fit(X, y, **kwargs)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 644, in _fit
    validation_data = self.validation_data

Here are the histories:

  1. https://usegalaxy.eu/u/kaivan/h/dlfnn
  2. https://usegalaxy.eu/u/kaivan/h/dlrnn
  3. https://usegalaxy.eu/u/kaivan/h/dlcnn

Per @anuprulez' suggestion, I downgraded the tool versions and the first and second workflow work now. Below is the downgrade:

  1. Create a deep learning model architecture: downgraded to 0.4.2
  2. Create a deep learning model with an optimizer, loss function and fit parameters: downgraded 0.4.2
  3. Deep learning training and evaluation conduct deep training and evaluation either implicitly or explicitly: downgraded to 1.0.8.2

The third workflow still fails. BTW, it requires the most recent version of the third tool.

I started writing unit tests in galaxytools (https://github.com/kxk302/galaxytools/tree/nn_tests), so these workflows are run as part of the unit test. They would serve as regression tests and would guarantee future changes would not break old code. However, I ran into another issue: models saved to file cannot be loaded and error out. Not sure if this is related to the workflow error above. Here is the error message:

unzip cnn.zip
Archive: cnn.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of cnn.zip or
cnn.zip.zip, and cannot find cnn.zip.ZIP, period.
@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

@anuprulez I just ran my third workflow (CNN workflow) on galaxy.eu and it failed. Could you please check the log to see what error message we get? Thanks.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

I only see "Failed to communicate with remote job server."

@anuprulez
Copy link
Contributor

@kxk302 in the first and third histories, I don't have permission to see those datasets. Can you unlock those?

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

Update: I re-ran the third history after the initial failure and it completed successfully.

@anuprulez how do I unlock the datasets? I don't see an option when trying to share history. If you want we can use Gitter to resolve this. Thx

@anuprulez
Copy link
Contributor

I see some changes have been made to: https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml very recently

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml

Yes, there was a bug fix in Galaxy-ML that was pushed recently.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

You need to re-name the uploaded files and change their type to tabular, before running the workflows. Thx.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

@anuprulez did you downgrade the tool versions in the RNN workflow?

@anuprulez
Copy link
Contributor

anuprulez commented May 13, 2021 via email

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

If you downgrade the tool versions as I documented, it will work.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

I guess the questions is why it stopped working with the new versions of those tools.

@qiagu
Copy link
Contributor

qiagu commented May 13, 2021

Try to check various package versions in the conda environment, python version as well (make sure python 3.6). The conda includes a lot of members, prone to make errors when a newer package joins the team.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

Thanks @qiagu,

Could you please provide more info on how to do that?

@qiagu
Copy link
Contributor

qiagu commented May 13, 2021

Sorry, I just say a general debugging process, not specific to any issue mentioned in this thread. From the stderr report @anuprulez provided, I feel the errors could be cleared by re-cleaning the input TSVs.

@qiagu
Copy link
Contributor

qiagu commented May 13, 2021

Try to ensure the classification targets are integers, not float.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

I do not see the errors that Anup sees. I guess the first step would be to get these workflows working with older versions of the tools. Then we can use the new version to re-produce the problem. @anuprulez not sure what your internet connectivity is like, but we could possibly have a Zoom meeting to discuss tomorrow (Friday). I'm free from 8:00 am to 10:100 am EST time.

@mvdbeek
Copy link
Collaborator

mvdbeek commented May 13, 2021

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

@kxk302
Copy link
Contributor Author

kxk302 commented May 13, 2021

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

This is run on EU. I remember vaguely Bjorn saying that some jobs are configured to run on GPU and this error would show up then, and the error would go away when job was run on CPU. Am I right @bgruening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants