The training succeed and the model seems to be properly saved but hav…

…e error in loading model. Example wandb link: https://wandb.ai/understanding-sam/levanter/runs/pdi0vc3w?nw=nwuserwhen
stanford-crfm · Dec 12, 2024 · 271479f · 271479f
1 parent 2f119bd
commit 271479f
Showing 1 changed file with 10 additions and 0 deletions.
diff --git a/error_loading_model.sh b/error_loading_model.sh
@@ -0,0 +1,10 @@
+eval $(ssh-agent -s)
+bash infra/babysit-tpu-vm.sh muon-debug -z us-central2-b -t v4-128 --preemptible -- \
+WANDB_API_KEY=[WANDB_API_KEY] \
+bash levanter/infra/run.sh python \
+levanter/src/levanter/main/train_lm.py \
+--config_path levanter/config/llama2_100M_muon.yaml  \
+--trainer.checkpointer.base_path  gs://marin-us-central2/scratch/kaiyue/checkpoints/muon/llama2_100M_constant  \
+--optimizer.type muon \
+--trainer.num_train_steps 10000 \
+--trainer.load_checkpoint_path  gs://marin-us-central2/scratch/kaiyue/checkpoints/muon/llama2_100M_constant/tjo9vxfb/step-4000