Update on "[MoE][PoC] Expert Parallel: tp and tp2ep"

Issues (12/11/2024) - forward collectives look right ("tp2ep" AG -> compute -> RS), need to understand the backward better - torch.compile generates full graph (applied per TransformerBlock), but inserts an additional A2A at the end of every two blocks Haven't worked on - softmax scoring when Router Parallel is used (currently only sigmoid) [ghstack-poisoned]
pytorch · Dec 12, 2024 · fa01d7c · fa01d7c
1 parent 50faa5a
commit fa01d7c
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/train_configs/debug_model.toml b/train_configs/debug_model.toml
@@ -38,7 +38,7 @@ max_norm = 1.0  # grad norm clipping
 steps = 10
 data_parallel_replicate_degree = 1
 data_parallel_shard_degree = -1
-tensor_parallel_degree = 4
+tensor_parallel_degree = 1
 compile = false
 dataset = "c4_test"  # supported datasets: c4_test (2K), c4 (177M)