-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start finetuning #724
Comments
Does your node have internet access? Can you share your output log, there must be an error log somewhere. |
Hi, I see where the problem may be now as the computing nodes I use do not have internet access. Is internet access required even if I were to download the models/checkpoint from https://github.com/ACEsuit/mace-mp/releases/tag/mace_mp_0b to finetune as well? Other than the log file I previously attached, the job output had the following:
|
You can download the model from github and put it in |
I had previously tried downloading the model and specifying the path to it in my training input like this:
`Matplotlib created a temporary cache directory at /dev/shm/cchong_6174208/matplotlib-39h7ig7_ because the default path (/home/ec225/ec225/cchong/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. mace_run_train 8 run_train.py 63 main run_train.py 261 run multihead_tools.py 183 assemble_mp_data RuntimeError: |
sorry it is not just the model you need to download https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/mp_traj_combined.xyz and https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/descriptors.npy to your |
It might be nice to have a |
or more something like --cache-files that only downloads |
or how about a --dry-run option which does various tasks up to but not including training? we could have --dry-run-levels, the first just checks argument validity, the second downloads (and saves) files if needed, the third one evaluates the loss once without updating any weights. |
I was wondering if there is a way to specify the path of the ~/cache/mace directory or to redefine it elsewhere? |
I see, currently there is no way to do that from mace atm, maybe there is a way to hack it from your env variables, like a path link. We should add an option to provide a path for these files. |
Got it, I will play around with that, thank you! |
I did not try this but trying to setup XDG_CACHE_HOME in your script to point to a .cache folder on the /work may help. |
ok to answer myself, will not work since path is hardcoded...
you can edit the three lines to your path or maybe have something like
@ilyes319 if you are happy I can pr this change. |
sure, happy to merge. Will it change anything to the default? |
Hi all, thank you for the help and suggestions. I tried changing the cache_dir lines in both the multihead_tools.py and the foundation_models.py files to point towards a directory in my working directory:
From there, I had downloaded the following files:
Unfortunately, I am still getting the error when I tried to commence the finetuning: ` mace_run_train 8 run_train.py 63 main run_train.py 261 run multihead_tools.py 183 assemble_mp_data RuntimeError: Sorry for being a pain but can you tell me where I might be going wrong with this please? |
did you reinstall the changed version? |
@cecilia-hong can you try this #755 you can install it by python3 -m pip install -U git+https://github.com/alinelena/mace@custom_cache but I suggest to have a clean environment in which you test. all you need to do is |
Hi Alin, Sorry fo taking so long to get back to you. I had made a new python virtual environment to install the changed version of MACE but have been having trouble with getting the right modules onto the env so haven't been able to test it out thus far but will definitely let you know once I get it working! |
Hello, can confirm the version you sent is working now, thank you! |
allow custom cache based on XDG_CACHE_HOME env variable, addresses #724
Hello,
I hope you are all doing well, thank you for your help in my previous issue, unfortunately I have ran into another, this time when I tried to fine tune based on a foundation model.
So for my first try, I followed the instructions here: https://mace-docs.readthedocs.io/en/latest/guide/finetuning.html
and my training input are as follows:
mace_run_train \ --name="MACE" \ --foundation_model="small" \ --multiheads_finetuning=False \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --energy_weight=1.0 \ --forces_weight=1.0 \ --E0s="average" \ --energy_weight=100 \ --forces_weight=1 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --max_num_epochs=6 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --default_dtype="float64" \ --device=cuda \ --seed=3
However, my training will not start and I cannot seem to understand the source of the error from the output files. I have attached my log file to this, can you please help me with this?
MACE_run-3_debug.log
The text was updated successfully, but these errors were encountered: