From 1b960c40c0f6d7b46b5abc171360a7216eb3e12d Mon Sep 17 00:00:00 2001 From: Tianyu Liu Date: Tue, 27 Feb 2024 12:47:02 -0800 Subject: [PATCH 1/2] improve TensorBoard instructions in README [ghstack-poisoned] --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 81d78720..64fc10c2 100644 --- a/README.md +++ b/README.md @@ -32,21 +32,21 @@ run the llama debug model locally to verify the setup is correct: # TensorBoard -To visualize training metrics on TensorBoard: +To visualize TensorBoard metrics of models trained on a remote server via a local web browser: -1. (by default) set `enable_tensorboard = true` in `torchtrain/train_configs/train_config.toml` +1. Make sure `metrics.enable_tensorboard` option is set to true in model training (either from a .toml file or from CLI). -2. set up SSH tunneling +2. Set up SSH tunneling, by running the following from local CLI ``` ssh -L 6006:127.0.0.1:6006 [username]@[hostname] ``` -3. then in the torchtrain repo +3. On the remote server, in the torchtrain repo, start the TensorBoard backend ``` tensorboard --logdir=./torchtrain/outputs/tb ``` -4. go to the URL it provides OR to http://localhost:6006/ +4. In the local web browser, go to the URL it provides OR to http://localhost:6006/. ## Multi-Node Training For training on ParallelCluster/Slurm type configurations, you can use the multinode_trainer.slurm file to submit your sbatch job.
From 8f356fcdd8936df6c9f7a6645fd1de2e165cf376 Mon Sep 17 00:00:00 2001 From: Tianyu Liu Date: Tue, 27 Feb 2024 13:43:09 -0800 Subject: [PATCH 2/2] Update on "improve TensorBoard instructions in README" Based on the feedbacks received, improve the TB section in README to reduce ambiguity. Also updated some outdated info. [ghstack-poisoned] --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 64fc10c2..62bcdc57 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ To visualize TensorBoard metrics of models trained on a remote server via a loca ssh -L 6006:127.0.0.1:6006 [username]@[hostname] ``` -3. On the remote server, in the torchtrain repo, start the TensorBoard backend +3. Inside the SSH tunnel that logged into the remote server, go to the torchtrain repo, and start the TensorBoard backend ``` tensorboard --logdir=./torchtrain/outputs/tb ```