Coord refactor #186

t4c1 · 2025-01-17T09:29:31Z

Refactor coordinates for PVC copies to be consistent with how copies for all CUDA GPUs are called.

joeatodd

Nice work @t4c1 - a few small things I spotted.

joeatodd · 2025-01-29T11:08:40Z

include/cute/atom/copy_traits_xe.hpp

+  auto
+  get_pvc_tensor(GShape const& g_shape) const {
+    static_assert(rank(GShape{}) == 3, "mismatch rank");
+    return make_counting_tensor(make_layout(g_shape, make_stride(E<0>(), E<1>(), E<2>())));


get_tma_tensor uses g_stride_ for the 2nd arg to make_layout here. Is there any loss of generality with this simpler approach?

joeatodd · 2025-01-29T11:14:34Z

include/cute/atom/copy_traits_xe.hpp

+    constexpr int dtype_size = sizeof(dtype);
+    constexpr int bits_in_byte = 8;


Cutlass provides cutlass::sizeof_bits<dtype> for this

joeatodd · 2025-01-29T11:15:33Z

include/cute/atom/copy_traits_xe.hpp

+    static_assert(is_rmem<TS>::value);
+    static_assert(size(SLayout{}) * dtype_size * bits_in_byte == size<1>(typename Traits_ST_t::SrcLayout{}),
+                  "Src tensor size does not match copy atom size");
+    static_assert(size(DLayout{}) * dtype_size * bits_in_byte == size<1>(typename Traits_ST_t::DstLayout{}),


As above, use cutlass::sizeof_bits<dtype> I think.

joeatodd · 2025-01-29T11:26:50Z

include/cutlass/gemm/collective/xe_mma.hpp

@@ -137,12 +137,31 @@ struct CollectiveMma<
  using traits_load_B = Copy_Traits<GmemTiledCopyB, StrideB>;
  using atom_load_B = Copy_Atom<traits_load_B, ElementB>;


I think the changes from this file need to be copied over to xe_mma_mixed_input.hpp. I am getting local failure of ninja test_unit_gemm_device

joeatodd · 2025-01-29T11:28:39Z

include/cutlass/gemm/collective/xe_mma.hpp

  using  TensorMKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementA const*>(nullptr)), make_shape(0,0,0), StrideA{}));   //(m, k)
  using  TensorNKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementB const*>(nullptr)), make_shape(0,0,0), StrideB{}));   //(n, k)


joeatodd · 2025-01-29T11:32:29Z

include/cutlass/gemm/collective/xe_mma.hpp


    // Instantiate the MMA object and get thread slice
    TiledMma tiled_mma;
-    auto thr_mma = tiled_mma.get_slice(thread_idx);
+    // To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp


Can we have a TODO(Codeplay): here to fix this later?

joeatodd · 2025-01-29T11:39:17Z

include/cutlass/gemm/collective/xe_mma.hpp

+    Tensor tArA = thr_copy_A2.retile_D(tCrA);
+    Tensor tBrB = thr_copy_B2.retile_D(tCrB);
+
+    // Retile global tile for copies
+    Tensor tAgA = thr_copy_A2.retile_S(tCgA);
+    Tensor tBgB = thr_copy_B2.retile_S(tCgB);


retile_D and retile_S do the same thing by the way. Not sure if that affects what's going on here - but I don't think I've seen both used anywhere before.

mehdi-goli · 2025-01-29T12:31:18Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this CTA is responsible for                                                 // (BLK_M,BLK_N)
+    Tensor g_cta_D = g_cta_D_mnl(_,_,m_coord,n_coord,l_coord);                                                   // (BLK_M,BLK_N)


I am wondering here, if it should be possible to avoid this and have something like

Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(m_coord,n_coord,l_coord), Step<_1,_1, X>{});

mehdi-goli · 2025-01-29T12:32:00Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    Tensor gD_mnl = local_tile(g_cta_D, SubgroupTileShape{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this warp is responsible for
+    Tensor gD = gD_mnl(_,_,m_sg,n_sg);                                                   // (BLK_M,BLK_N)


mehdi-goli · 2025-01-29T12:33:51Z

include/cutlass/gemm/collective/xe_mma.hpp


    // Instantiate the MMA object and get thread slice
    TiledMma tiled_mma;
-    auto thr_mma = tiled_mma.get_slice(thread_idx);
+    // To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp


Suggested change

// To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp

// To make all work items in a subgroup have the same global tensors pass in the index of work item 0 in each subgroup

FMarno · 2025-01-29T09:50:29Z

include/cute/atom/copy_traits_xe.hpp

+  using SrcLayout = Layout<Shape <_16,Shape <_16,  _2, _32>>,
+                           Stride<_0,Stride< _1,_256,_512>>>;


the formatting is a bit messy here

FMarno · 2025-01-29T10:00:04Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

@@ -310,12 +317,27 @@ class CollectiveEpilogue<
    auto sg_coord = make_coord(sg_m_coord, sg_n_coord, k_coord, l_coord);

    bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
+
+    // Represent the full output tensor
+    Tensor mD_mnl = params.xe_store_d.get_pvc_tensor(make_shape(M,N,L));


this should be a counting tensor of D I believe, so maybe cD would be more appropriate? (I'm not really sure)

FMarno · 2025-01-29T11:34:04Z

include/cute/atom/copy_traits_xe.hpp

@@ -238,6 +244,15 @@ struct XE_2D_LD_Unpack {
                       make_layout(t_shape, t_stride));
  }

+  // Generate the PVC coord tensor
+  template <class GShape>


this seems unrelated to the class it's in. Maybe it shouldn't be a part of copy traits?

FMarno · 2025-01-29T12:06:56Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    // Tile the output tensor per CTA
+    Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this CTA is responsible for                                                 // (BLK_M,BLK_N)
+    Tensor g_cta_D = g_cta_D_mnl(_,_,m_coord,n_coord,l_coord);                                                   // (BLK_M,BLK_N)


Suggested change

// Tile the output tensor per CTA

Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(_,_,_), Step<_1,_1, X>{}); // (BLK_M,BLK_N,m,n,l)

// Slice to get the tile this CTA is responsible for // (BLK_M,BLK_N)

Tensor g_cta_D = g_cta_D_mnl(_,_,m_coord,n_coord,l_coord); // (BLK_M,BLK_N)

// Tile the output tensor per CTA

Tensor g_cta_D = local_tile(mD_mnl, take<0,2>(CtaTileMNK{}), make_coord(m_coord,n_coord,l_coord)); // (BLK_M,BLK_N)

I think this is simpler.
Maybe it should be cta_cD?

FMarno · 2025-01-29T12:11:43Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    // Tile the output tensor per warp
+    Tensor gD_mnl = local_tile(g_cta_D, SubgroupTileShape{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this warp is responsible for
+    Tensor gD = gD_mnl(_,_,m_sg,n_sg);                                                   // (BLK_M,BLK_N)


Suggested change

// Tile the output tensor per warp

Tensor gD_mnl = local_tile(g_cta_D, SubgroupTileShape{}, make_coord(_,_,_), Step<_1,_1, X>{}); // (BLK_M,BLK_N,m,n,l)

// Slice to get the tile this warp is responsible for

Tensor gD = gD_mnl(_,_,m_sg,n_sg); // (BLK_M,BLK_N)

// Tile the output tensor per warp

Tensor gD = local_tile(g_cta_D, SubgroupTileShape{}, make_coord(m_sg,n_sg)); // (SG_M, SG_N)

I think this is correct too

FMarno · 2025-01-29T12:36:07Z

include/cutlass/gemm/kernel/xe_gemm.hpp

+    auto gA_mk = local_tile(mA_mk, blk_shape, make_coord(_,_,_), Step<_1,  X, _1>{});
+    auto gB_nk = local_tile(mB_nk, blk_shape, make_coord(_,_,_), Step< X, _1, _1>{});
+
+    // Slice with m_coord and n_coord
+    Tensor gA = gA_mk(_,_,m_coord,_);                                                       // (BLK_M,BLK_K,k)
+    Tensor gB = gB_nk(_,_,n_coord,_);                                                       // (BLK_N,BLK_K,k)


Suggested change

auto gA_mk = local_tile(mA_mk, blk_shape, make_coord(_,_,_), Step<_1, X, _1>{});

auto gB_nk = local_tile(mB_nk, blk_shape, make_coord(_,_,_), Step< X, _1, _1>{});

// Slice with m_coord and n_coord

Tensor gA = gA_mk(_,_,m_coord,_); // (BLK_M,BLK_K,k)

Tensor gB = gB_nk(_,_,n_coord,_); // (BLK_N,BLK_K,k)

auto gA = local_tile(mA_mk, blk_shape, make_coord(m_coord,_,_), Step<_1, X, _1>{}); // (BLK_M,BLK_K,k)

auto gB = local_tile(mB_nk, blk_shape, make_coord(_,n_coord,_), Step< X, _1, _1>{}); // (BLK_N,BLK_K,k)

mehdi-goli · 2025-01-29T12:40:09Z

include/cutlass/gemm/collective/xe_mma.hpp

+    Tensor tCrA = make_tensor<ElementA>(tCgA(_,_,_,0).shape());
+    Tensor tCrB = make_tensor<ElementB>(tCgB(_,_,_,0).shape(), make_stride(_1{}, shape<0>(tCgB) * shape<2>(tCgB), shape<0>(tCgB)));


This too line does not seems to match what you are aiming to do

mehdi-goli · 2025-01-29T12:44:58Z

include/cutlass/gemm/kernel/xe_gemm.hpp

+    Tensor mA_mk = mA_mkl(_,_,l_coord);                                          // (m,k)
+    Tensor mB_nk = mB_nkl(_,_,l_coord);                                          // (n,k)
+
+    auto gA_mk = local_tile(mA_mk, blk_shape, make_coord(_,_,_), Step<_1,  X, _1>{});
+    auto gB_nk = local_tile(mB_nk, blk_shape, make_coord(_,_,_), Step< X, _1, _1>{});
+
+    // Slice with m_coord and n_coord
+    Tensor gA = gA_mk(_,_,m_coord,_);                                                       // (BLK_M,BLK_K,k)
+    Tensor gB = gB_nk(_,_,n_coord,_);                                                       // (BLK_N,BLK_K,k)


Same here I think it should be possible to say:

Tensor gA = local_tile(mA_mkl, blk_shape, make_coord(m_coord,_,l_coord), Step<_1, X, _1>{}); Tensor gB = local_tile(mB_nkl, blk_shape, make_coord(n_coord,_,l_coord), Step< X, _1, _1>{});

t4c1 added 10 commits January 17, 2025 09:23

PoC

4eb88ab

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

5595428

WIP coord tensor

4118c9f

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

6516e21

working new xe_mma

2c5226f

cleanup

bbd5a58

check Copy and MMA size

f7dfbc0

more cleanup

73ec96a

add asserts to copy_unpack

fa1bc03

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

684804e

t4c1 marked this pull request as ready for review January 29, 2025 09:38

bugfix rmem check

4149f00

joeatodd reviewed Jan 29, 2025

View reviewed changes

mehdi-goli reviewed Jan 29, 2025

View reviewed changes

FMarno reviewed Jan 29, 2025

View reviewed changes

mehdi-goli reviewed Jan 29, 2025

View reviewed changes

aacostadiaz mentioned this pull request Jan 29, 2025

Adding bandwidth measurement for flash attention #201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coord refactor #186

Coord refactor #186

t4c1 commented Jan 17, 2025

joeatodd left a comment

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

joeatodd Jan 29, 2025

mehdi-goli Jan 29, 2025

mehdi-goli Jan 29, 2025

mehdi-goli Jan 29, 2025 •

edited

Loading

FMarno Jan 29, 2025

FMarno Jan 29, 2025

FMarno Jan 29, 2025

FMarno Jan 29, 2025

FMarno Jan 29, 2025

FMarno Jan 29, 2025

mehdi-goli Jan 29, 2025

mehdi-goli Jan 29, 2025 •

edited

Loading

		constexpr int dtype_size = sizeof(dtype);
		constexpr int bits_in_byte = 8;

		@@ -137,12 +137,31 @@ struct CollectiveMma<
		using traits_load_B = Copy_Traits<GmemTiledCopyB, StrideB>;
		using atom_load_B = Copy_Atom<traits_load_B, ElementB>;

		using TensorMKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementA const*>(nullptr)), make_shape(0,0,0), StrideA{})); //(m, k)
		using TensorNKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementB const*>(nullptr)), make_shape(0,0,0), StrideB{})); //(n, k)

	// To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp
	// To make all work items in a subgroup have the same global tensors pass in the index of work item 0 in each subgroup

		using SrcLayout = Layout<Shape <_16,Shape <_16, _2, _32>>,
		Stride<_0,Stride< _1,_256,_512>>>;

		Tensor tCrA = make_tensor<ElementA>(tCgA(_,_,_,0).shape());
		Tensor tCrB = make_tensor<ElementB>(tCgB(_,_,_,0).shape(), make_stride(_1{}, shape<0>(tCgB) * shape<2>(tCgB), shape<0>(tCgB)));

Coord refactor #186

Are you sure you want to change the base?

Coord refactor #186

Conversation

t4c1 commented Jan 17, 2025

joeatodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 •

edited

Loading

mehdi-goli Jan 29, 2025 •

edited

Loading