Skip to content

Commit

Permalink
udpate readme
Browse files Browse the repository at this point in the history
  • Loading branch information
msaroufim committed Dec 3, 2024
1 parent b5e0ebf commit 1dad36d
Showing 1 changed file with 1 addition and 5 deletions.
6 changes: 1 addition & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ You may want to see how the model is defined or how parallelism techniques are a
6. DDP and HSDP
7. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries)
8. Learning rate scheduler, meta-init, (optional) fused RMSNorm kernel
9. Loss, GPU memory, throughput (tokens/sec), and MFU displayed and logged via [Tensorboard or Weights & Biases](#logging)
9. Loss, GPU memory, throughput (tokens/sec), and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
10. Debugging tools including CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [Flight Recorder](#debugging), etc.
11. All options easily configured via [toml files](train_configs/)

Expand Down Expand Up @@ -99,10 +99,6 @@ Llama 3 8B model locally on 8 GPUs
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
```

## Logging

We support logging via both TensorBoard and Weights and Biases, all you need to is enable it in your `toml` file or CLI using `enable_tb` or `enable_wandb` respectively. You can learn more [here](docs/metrics.md)

## Multi-Node Training
For training on ParallelCluster/Slurm type configurations, you can use the `multinode_trainer.slurm` file to submit your sbatch job.

Expand Down

0 comments on commit 1dad36d

Please sign in to comment.