Skip to content

Commit

Permalink
The training succeed and the model seems to be properly saved but hav…
Browse files Browse the repository at this point in the history
…e error in loading model.

Example wandb link: https://wandb.ai/understanding-sam/levanter/runs/pdi0vc3w?nw=nwuserwhen
  • Loading branch information
WhenWen committed Dec 12, 2024
1 parent 2f119bd commit 271479f
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions error_loading_model.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
eval $(ssh-agent -s)
bash infra/babysit-tpu-vm.sh muon-debug -z us-central2-b -t v4-128 --preemptible -- \
WANDB_API_KEY=[WANDB_API_KEY] \
bash levanter/infra/run.sh python \
levanter/src/levanter/main/train_lm.py \
--config_path levanter/config/llama2_100M_muon.yaml \
--trainer.checkpointer.base_path gs://marin-us-central2/scratch/kaiyue/checkpoints/muon/llama2_100M_constant \
--optimizer.type muon \
--trainer.num_train_steps 10000 \
--trainer.load_checkpoint_path gs://marin-us-central2/scratch/kaiyue/checkpoints/muon/llama2_100M_constant/tjo9vxfb/step-4000

0 comments on commit 271479f

Please sign in to comment.