Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running, the error: CUDA error out of memory #17

Open
simba0626 opened this issue Oct 11, 2019 · 7 comments
Open

When running, the error: CUDA error out of memory #17

simba0626 opened this issue Oct 11, 2019 · 7 comments

Comments

@simba0626
Copy link

Hi, sorry to trouble you again.
When I run : CUDA_VISIBLE_DEVICES=1 python corechain.py -model slotptr -device cuda -dataset lcquad -pointwise True
1

The error line: loss.backward()
My GPU memory : 10G.

Thank you for your help.

@simba0626
Copy link
Author

Sorry, trouble you, again.
After I set pretrain embedding requires_grad = False, it is ok.
In detail, below:
if vectors is not None:
self.embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
self.embedding_layer.weight.requires_grad = False #True

It means its embeddings is non-trainable.
Does this setup influence to re-implement experiment results ?

Thank you.

@saist1993
Copy link
Member

It would affect the results. Can you tell me the batch size and other related hyperparameters? Also, can you run it with pointwise False?

@simba0626
Copy link
Author

Hi, batch size = 4000, epoches = 100, other related hyperparameters are the same as the source. In specific:

[lcquad]
_neg_paths_per_epoch_train = 100
_neg_paths_per_epoch_validation = 1000
total_negative_samples = 1000
batch_size = 4000
hidden_size = 256
number_of_layer = 1
embedding_dim = 300
vocab_size = 15000
dropout = 0.5
dropout_rec = 0.3
dropout_in = 0.3
output_dim = 300
rel_pad = 25
relsp_pad = 12
relrd_pad = 2

I run command line:
CUDA_VISIBLE_DEVICES=1 python corechain.py -model slotptr -device cuda -dataset lcquad -pointwise False

@simba0626
Copy link
Author

Sorry, trouble you. The result is below: BestValiAcc: 0.654. BestTestAcc: 0.664
In addition, when evaluate, RuntimeError: CUDA error: out of memory.
a

Would you help me to solve it ?
Thank you

@saist1993
Copy link
Member

I think there is happening because the file is trying to load another slot pointer instance while there is already one slot pointer instance in the memory. This will not affect the final result much as the best performing model (one with the highest validation accuracy) gets stored in the disk. I have highlighted the best accuracy result in the image.

You can run onefile.py with appropriate params to load the model and re-run the eval. I will also recommend to run it for a little longer epoch as it looks like the model has not converged.

image

@simba0626
Copy link
Author

ok, I have a try. But I have a question: how much epoch should be set ?

thanks

@simba0626
Copy link
Author

I find 300 epochs in the paper. I have a try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants