Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
simplify embedding + first transformer block TP (#314)
as titled, we can directly specify the rowwise parallel embedding output layouts be shard on sequence dim, so that we don't need the first layer prepare input. Switching to output_layouts = Shard(1) would also trigger reduce_scatter instead of allreduce for embedding layer, which could give some small perf wins
- Loading branch information