Optimize `Hive hash` computation for nested types #2720

ustcfy · 2024-12-23T07:54:55Z

Main Optimization:

Flatten nested columns in advance, reducing the size of the stack_frame.

Possible Further Optimization:

Make various hash functions share the logic for flattening nested types, but xxhash64 does not require _cur_hash, which presents a challenge for unification.

Benchmark:

Environment: NVIDIA TITAN RTX
Sizes: 50MB, 100MB, 500MB, 1GB

schema: struct with a depth of max_depth, with the basic type being INT32, FLOAT32 and STRING

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   50000000   |      1      | 609.823 us |      56.37% | 452.706 us |       8.10% |   -157.117 us | -25.76% |   FAST   |
|  100000000   |      1      | 958.429 us |      32.03% | 606.708 us |      66.13% |   -351.722 us | -36.70% |   FAST   |
|  500000000   |      1      |   3.981 ms |      10.85% |   2.091 ms |       1.65% |  -1889.419 us | -47.46% |   FAST   |
|  1000000000  |      1      |   7.716 ms |       5.89% |   3.929 ms |       0.50% |  -3786.579 us | -49.08% |   FAST   |
|   50000000   |      2      |   1.493 ms |       1.52% | 555.161 us |       5.56% |   -937.916 us | -62.82% |   FAST   |
|  100000000   |      2      |   2.760 ms |       0.61% | 892.648 us |       3.01% |  -1867.525 us | -67.66% |   FAST   |
|  500000000   |      2      |  13.026 ms |       0.39% |   3.539 ms |      16.46% |  -9487.762 us | -72.83% |   FAST   |
|  1000000000  |      2      |  25.811 ms |       0.28% |   6.790 ms |       0.59% | -19021.039 us | -73.69% |   FAST   |
|   50000000   |      4      |   2.530 ms |       7.86% |   1.094 ms |       0.66% |  -1436.823 us | -56.78% |   FAST   |
|  100000000   |      4      |   4.850 ms |       5.51% |   1.946 ms |       1.25% |  -2903.567 us | -59.87% |   FAST   |
|  500000000   |      4      |  23.405 ms |       0.13% |   8.761 ms |       0.47% | -14643.401 us | -62.57% |   FAST   |
|  1000000000  |      4      |  46.649 ms |       0.20% |  17.254 ms |       2.64% | -29395.025 us | -63.01% |   FAST   |
|   50000000   |      8      |   4.468 ms |       0.68% |   2.102 ms |       0.53% |  -2366.352 us | -52.96% |   FAST   |
|  100000000   |      8      |   8.731 ms |       0.44% |   3.953 ms |       0.42% |  -4777.431 us | -54.72% |   FAST   |
|  500000000   |      8      |  42.672 ms |       0.08% |  18.808 ms |       0.39% | -23863.937 us | -55.92% |   FAST   |
|  1000000000  |      8      |  85.191 ms |       0.05% |  37.338 ms |       0.30% | -47853.264 us | -56.17% |   FAST   |

schema: list with a depth of max_depth, with the basic type being INT32

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   50000000   |      1      | 474.191 us |       5.38% | 467.645 us |      36.75% |      -6.546 us |  -1.38% |   SAME   |
|  100000000   |      1      | 682.398 us |      47.04% | 604.959 us |      65.52% |     -77.439 us | -11.35% |   SAME   |
|  500000000   |      1      |   2.523 ms |      10.46% |   2.630 ms |      11.76% |     106.458 us |   4.22% |   SAME   |
|  1000000000  |      1      |   4.800 ms |       0.20% |   3.370 ms |      10.34% |   -1429.947 us | -29.79% |   FAST   |
|   50000000   |      2      | 890.665 us |       1.14% | 867.720 us |       1.51% |     -22.946 us |  -2.58% |   FAST   |
|  100000000   |      2      |   1.858 ms |       0.55% |   1.832 ms |       1.02% |     -25.131 us |  -1.35% |   FAST   |
|  500000000   |      2      |   9.657 ms |       4.41% |   9.763 ms |       0.44% |     105.982 us |   1.10% |   SLOW   |
|  1000000000  |      2      |  19.158 ms |       2.73% |  19.139 ms |       0.26% |     -19.141 us |  -0.10% |   SAME   |
|   50000000   |      4      |   9.557 ms |       0.41% |   8.215 ms |       0.20% |   -1341.963 us | -14.04% |   FAST   |
|  100000000   |      4      |   9.596 ms |       0.20% |   8.235 ms |       0.21% |   -1360.763 us | -14.18% |   FAST   |
|  500000000   |      4      |  10.212 ms |       0.35% |   8.836 ms |       0.36% |   -1375.577 us | -13.47% |   FAST   |
|  1000000000  |      4      |  18.739 ms |       0.80% |  10.990 ms |       0.25% |   -7748.325 us | -41.35% |   FAST   |
|   50000000   |      8      |   3.643 us |      20.58% |   5.008 us |      12.68% |       1.365 us |  37.48% |   SLOW   |
|  100000000   |      8      |   3.612 us |      19.83% |   4.869 us |      13.84% |       1.257 us |  34.80% |   SLOW   |
|  500000000   |      8      |   19.112 s |        inf% |   18.441 s |        inf% | -671019.531 us |  -3.51% |   SAME   |
|  1000000000  |      8      |   19.252 s |        inf% |   18.581 s |        inf% | -670972.656 us |  -3.49% |   SAME   |

schema: list with a depth of max_depth, with the basic type being STRING

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   50000000   |      1      |   1.225 ms |      34.18% |   1.238 ms |      34.57% |    12.566 us |   1.03% |   SAME   |
|  100000000   |      1      |   2.469 ms |       0.32% |   2.474 ms |       0.32% |     5.280 us |   0.21% |   SAME   |
|  500000000   |      1      |  11.670 ms |       0.13% |  11.631 ms |       0.48% |   -39.053 us |  -0.33% |   FAST   |
|  1000000000  |      1      |  23.225 ms |       0.32% |  23.167 ms |       0.24% |   -57.943 us |  -0.25% |   FAST   |
|   50000000   |      2      | 986.634 us |       0.99% | 959.146 us |       1.21% |   -27.489 us |  -2.79% |   FAST   |
|  100000000   |      2      |   4.743 ms |       0.87% |   4.753 ms |       0.84% |     9.306 us |   0.20% |   SAME   |
|  500000000   |      2      |  35.896 ms |       0.59% |  39.882 ms |       0.34% |     3.986 ms |  11.10% |   SLOW   |
|  1000000000  |      2      |  74.205 ms |       0.29% |  82.584 ms |       0.31% |     8.379 ms |  11.29% |   SLOW   |
|   50000000   |      4      |  33.784 ms |       0.11% |  32.820 ms |       0.10% |  -964.448 us |  -2.85% |   FAST   |
|  100000000   |      4      |  34.308 ms |       0.11% |  33.523 ms |       0.22% |  -784.787 us |  -2.29% |   FAST   |
|  500000000   |      4      |  34.769 ms |       0.22% |  34.006 ms |       0.12% |  -763.420 us |  -2.20% |   FAST   |
|  1000000000  |      4      |  35.852 ms |       0.16% |  34.705 ms |       0.11% | -1147.547 us |  -3.20% |   FAST   |
|   50000000   |      8      |   4.155 us |      18.74% |   4.042 us |      18.08% |    -0.113 us |  -2.72% |   SAME   |
|  100000000   |      8      |   3.960 us |      18.35% |   3.965 us |      19.00% |     0.005 us |   0.12% |   SAME   |
|  500000000   |      8      |   4.157 us |      24.39% |   4.064 us |      17.89% |    -0.092 us |  -2.22% |   SAME   |
|  1000000000  |      8      |   52.247 s |        inf% |   52.920 s |        inf% |   673.637 ms |   1.29% |   SAME   |

Signed-off-by: Yan Feng <[email protected]>

thirtiseven

Could you post the benchmark results here?

Maybe also how it failed on the string type and how to reproduce it if you need help on that issue.

src/main/cpp/benchmarks/hash.cu

Signed-off-by: Yan Feng <[email protected]>

src/main/cpp/src/hive_hash.cu

res-life · 2024-12-24T09:09:40Z

src/main/cpp/src/hive_hash.cu

@@ -486,15 +526,77 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,

  check_nested_depth(input);

+  // `flattened_column_views` only contains nested columns and columns that result from flattening
+  // nested columns
+  std::vector<cudf::column_view> flattened_column_views;


The column view constructor will calculate null count which is time consuming.
The original approach does not need to calculate null count.
We may need to find a way to avoid this column view array.
Please help check if the contiguous_copy_column_device_views is helpful?

src/main/cpp/src/hive_hash.cu

res-life · 2024-12-24T09:59:47Z

Fixed the struct(string) type reports illegal memory access error:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index ca720b5..5e9bb35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -22,6 +22,7 @@
 #include <cudf/structs/structs_column_view.hpp>
 #include <cudf/table/experimental/row_operators.cuh>
 #include <cudf/table/table_device_view.cuh>
+#include <cudf/table/table_device_view.cuh>
 
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
@@ -566,17 +567,9 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     }
   }
 
-  std::vector<cudf::column_device_view> device_flattened_column_views;
-  device_flattened_column_views.reserve(flattened_column_views.size());
-
-  std::transform(
-    flattened_column_views.begin(),
-    flattened_column_views.end(),
-    std::back_inserter(device_flattened_column_views),
-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });
+  [[maybe_unused]] auto [device_view_owners, device_flattened_column_views] =
+    cudf::contiguous_copy_column_device_views<cudf::column_device_view>(flattened_column_views, stream);
 
-  auto flattened_column_device_views =
-    cudf::detail::make_device_uvector_async(device_flattened_column_views, stream, mr);
   auto first_child_index_view =
     cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
   auto nested_column_map_view =
@@ -594,7 +587,7 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     output_view.end<hive_hash_value_t>(),
     hive_device_row_hasher<hive_hash_function, bool>(nullable,
                                                      *input_view,
-                                                     flattened_column_device_views.data(),
+                                                     device_flattened_column_views,
                                                      first_child_index_view.data(),
                                                      nested_column_map_view.data()));

Please apply the above patch.

Signed-off-by: Yan Feng <[email protected]>

ttnghia · 2024-12-24T17:29:19Z

-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });

Remember to never dereference the output from column_device_view::create or table_device_view::create directly. They return a pointer. If we don't store that pointer, it will be dangled right away thus the reference to it will be invalidiated.

src/main/cpp/src/hive_hash.cu

Co-authored-by: Nghia Truong <[email protected]>

sperlingxx · 2024-12-25T01:54:09Z

I just came up with some random thought:
How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result.
By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

res-life · 2024-12-25T02:25:03Z

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types.
But for list type, e.g., list(int) column:
row0: [1,2,3]
row1: [1,2]
row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is:
[[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3))
[[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea.
We can discuss this further more.

sperlingxx · 2024-12-26T01:29:27Z

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)
Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types. But for list type, e.g., list(int) column: row0: [1,2,3] row1: [1,2] row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is: [[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3)) [[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea. We can discuss this further more.

Yes, you are right! In terms of list type, it seems impossible to get rid of stack operations.

sperlingxx

LGTM in general, except the some NIT improvements

src/main/cpp/src/hive_hash.cu

Co-authored-by: Nghia Truong <[email protected]>

Signed-off-by: Yan Feng <[email protected]>

Co-authored-by: Nghia Truong <[email protected]>

src/main/cpp/src/hive_hash.cu

res-life · 2025-01-06T07:20:29Z

We do not need: class element_hasher_adapter

   private:
    hash_functor_t hash_functor;
    hive_device_row_hasher const& _parent;

Move hash_functor into hive_device_row_hasher, then _parent is not needed.

Signed-off-by: Yan Feng <[email protected]>

ttnghia · 2025-01-07T00:26:00Z

We do not need: class element_hasher_adapter

This is a good idea. It can save a register by eliminating reference variable _parent. By doing so, we just move all the opereator() of the _hasher_adapter into the parent struct.

Signed-off-by: Yan Feng <[email protected]>

ttnghia · 2025-01-07T08:10:53Z

Here is another optimization:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index 7ce8026f22..49b04de86c 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -39,13 +39,6 @@ using hive_hash_value_t = int32_t;
 constexpr hive_hash_value_t HIVE_HASH_FACTOR = 31;
 constexpr hive_hash_value_t HIVE_INIT_HASH   = 0;
 
-struct col_info {
-  cudf::type_id type_id;
-  cudf::size_type
-    nested_num_children_or_basic_col_idx;  // Number of children for nested types, or column index
-                                           // in `basic_cdvs` for basic types
-};
-
 hive_hash_value_t __device__ inline compute_int(int32_t key) { return key; }
 
 hive_hash_value_t __device__ inline compute_long(int64_t key)
@@ -158,6 +151,18 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper
   return static_cast<hive_hash_value_t>(result);
 }
 
+/**
+ * @brief The struct storing column's auxiliary information.
+ */
+struct col_info {
+  // Column type id.
+  cudf::type_id type_id;
+
+  // Store the the upper bound of number of elements for lists column, or the upper bound of
+  // number of children for structs column, or column index in `basic_cdvs` for basic types.
+  cudf::size_type upper_bound_idx_or_basic_col_idx;
+};
+
 /**
  * @brief Computes the hash value of a row in the given table.
  *
@@ -205,11 +210,11 @@ class hive_device_row_hasher {
         auto const col_info        = _col_infos[flattened_index];
         auto const col_hash =
           (col_info.type_id == cudf::type_id::LIST || col_info.type_id == cudf::type_id::STRUCT)
-            ? hash_nested(flattened_index, row_index)
+            ? hash_nested(flattened_index, row_index, col_info)
             : cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
                 cudf::data_type{col_info.type_id},
                 _hash_functor,
-                _basic_cdvs[col_info.nested_num_children_or_basic_col_idx],
+                _basic_cdvs[col_info.upper_bound_idx_or_basic_col_idx],
                 row_index);
         return HIVE_HASH_FACTOR * hash + col_hash;
       });
@@ -221,19 +226,24 @@ class hive_device_row_hasher {
    */
   struct col_stack_frame {
    private:
-    cudf::size_type _col_idx;     // the column index in the flattened array
-    cudf::size_type _row_idx;     // the index of the row in the column
-    int _idx_to_process;          // the index of child or element to process next
-    hive_hash_value_t _cur_hash;  // current hash value of the column
+    col_info _col_info;               // the column info
+    cudf::size_type _col_idx;         // the column index in the flattened array
+    cudf::size_type _row_idx;         // the index of the row in the column
+    cudf::size_type _idx_to_process;  // the index of child or element to process next
+    hive_hash_value_t _cur_hash;      // current hash value of the column
 
    public:
     __device__ col_stack_frame() = default;
 
-    __device__ void init(cudf::size_type col_index, cudf::size_type row_idx)
+    __device__ void init(cudf::size_type col_index,
+                         cudf::size_type row_idx,
+                         cudf::size_type idx_begin,
+                         col_info info)
     {
       _col_idx        = col_index;
       _row_idx        = row_idx;
-      _idx_to_process = 0;
+      _idx_to_process = idx_begin;
+      _col_info       = info;
       _cur_hash       = HIVE_INIT_HASH;
     }
 
@@ -248,6 +258,8 @@ class hive_device_row_hasher {
 
     __device__ int get_idx_to_process() const { return _idx_to_process; }
 
+    __device__ col_info get_col_info() const { return _col_info; }
+
     __device__ cudf::size_type get_col_idx() const { return _col_idx; }
 
     __device__ cudf::size_type get_row_idx() const { return _row_idx; }
@@ -366,74 +378,96 @@ class hive_device_row_hasher {
    *
    * @param flattened_index The index of the column in the flattened array
    * @param row_index The index of the row to compute the hash for
+   * @param curr_col_info the column's information
    * @return The computed hive hash value
    */
   __device__ hive_hash_value_t hash_nested(cudf::size_type flattened_index,
-                                           cudf::size_type row_index) const noexcept
+                                           cudf::size_type row_index,
+                                           col_info curr_col_info) const noexcept
   {
     auto next_col_idx = flattened_index + 1;
 
     col_stack_frame col_stack[MAX_STACK_DEPTH];
     int stack_size = 0;
-    col_stack[stack_size++].init(flattened_index, row_index);
+
+    // If the current column is a lists column, we need to store the upper bound row offset.
+    // Otherwise, it is a structs column and already stores the number of children.
+    cudf::size_type curr_idx_begin = 0;
+    if (curr_col_info.type_id == cudf::type_id::LIST) {
+      auto const offsets = _basic_cdvs[_col_infos[next_col_idx].upper_bound_idx_or_basic_col_idx];
+      curr_idx_begin     = offsets.template element<cudf::size_type>(row_index);
+      curr_col_info.upper_bound_idx_or_basic_col_idx =
+        offsets.template element<cudf::size_type>(row_index + 1);
+    }
+    if (curr_col_info.upper_bound_idx_or_basic_col_idx == curr_idx_begin) { return HIVE_INIT_HASH; }
+
+    col_stack[stack_size++].init(flattened_index, row_index, curr_idx_begin, curr_col_info);
 
     while (stack_size > 0) {
       col_stack_frame& top     = col_stack[stack_size - 1];
       auto const curr_col_idx  = top.get_col_idx();
       auto const curr_row_idx  = top.get_row_idx();
-      auto const curr_col_info = _col_infos[curr_col_idx];
+      auto const curr_col_info = top.get_col_info();
       // Do not pop it until it is processed. The definition of `processed` is:
       // - For structs, it is when all child columns are processed.
       // - For lists, it is when all elements in the list are processed.
       if (curr_col_info.type_id == cudf::type_id::STRUCT) {
-        if (top.get_idx_to_process() == curr_col_info.nested_num_children_or_basic_col_idx) {
+        if (top.get_idx_to_process() == curr_col_info.upper_bound_idx_or_basic_col_idx) {
           if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
         } else {
           // Reset `next_col_idx` to keep track of the struct's children index.
           if (top.get_idx_to_process() == 0) { next_col_idx = curr_col_idx + 1; }
-          while (top.get_idx_to_process() < curr_col_info.nested_num_children_or_basic_col_idx) {
+
+          while (top.get_idx_to_process() < curr_col_info.upper_bound_idx_or_basic_col_idx) {
             top.get_and_inc_idx_to_process();
             auto const child_col_idx = next_col_idx++;
-            auto const child_info    = _col_infos[child_col_idx];
+            auto child_info          = _col_infos[child_col_idx];
             // If the child is of primitive type, accumulate child hash into struct hash
             if (child_info.type_id != cudf::type_id::LIST &&
                 child_info.type_id != cudf::type_id::STRUCT) {
-              auto const child_col = _basic_cdvs[child_info.nested_num_children_or_basic_col_idx];
+              auto const child_col = _basic_cdvs[child_info.upper_bound_idx_or_basic_col_idx];
               auto const child_hash =
                 cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
                   child_col.type(), _hash_functor, child_col, curr_row_idx);
               top.update_cur_hash(child_hash);
             } else {
-              col_stack[stack_size++].init(child_col_idx, curr_row_idx);
+              cudf::size_type child_idx_begin = 0;
+              if (child_info.type_id == cudf::type_id::LIST) {
+                auto const child_offsets_col_idx = child_col_idx + 1;
+                auto const child_offsets =
+                  _basic_cdvs[_col_infos[child_offsets_col_idx].upper_bound_idx_or_basic_col_idx];
+                child_idx_begin = child_offsets.template element<cudf::size_type>(curr_row_idx);
+                child_info.upper_bound_idx_or_basic_col_idx =
+                  child_offsets.template element<cudf::size_type>(curr_row_idx + 1);
+
+                // Ignore this child if it does not have any element.
+                if (child_info.upper_bound_idx_or_basic_col_idx == child_idx_begin) {
+                  next_col_idx += 2;
+                }
+              }
+              if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
+                col_stack[stack_size++].init(
+                  child_col_idx, curr_row_idx, child_idx_begin, child_info);
+              }
               break;
             }
           }
         }
       } else if (curr_col_info.type_id == cudf::type_id::LIST) {
-        // Get the child column of the list column
-        auto const offsets_col_idx = curr_col_idx + 1;
-        auto const child_col_idx   = curr_col_idx + 2;
+        auto const child_col_idx = curr_col_idx + 2;
+        auto child_info          = _col_infos[child_col_idx];
 
         // Move `next_col_idx` forward pass the current lists column.
         // Children of a lists column always stay next to it and are not tracked by this.
         if (next_col_idx <= child_col_idx) { next_col_idx = child_col_idx + 1; }
 
-        auto const offsets_col =
-          _basic_cdvs[_col_infos[offsets_col_idx].nested_num_children_or_basic_col_idx];
-
-        auto const child_col_info = _col_infos[child_col_idx];
-        auto const child_row_idx_begin =
-          offsets_col.template element<cudf::size_type>(curr_row_idx);
-        auto const child_row_idx_end =
-          offsets_col.template element<cudf::size_type>(curr_row_idx + 1);
-
         // If the child column is of primitive type, directly compute the hash value of the list
-        if (child_col_info.type_id != cudf::type_id::LIST &&
-            child_col_info.type_id != cudf::type_id::STRUCT) {
-          auto const child_col = _basic_cdvs[child_col_info.nested_num_children_or_basic_col_idx];
+        if (child_info.type_id != cudf::type_id::LIST &&
+            child_info.type_id != cudf::type_id::STRUCT) {
+          auto const child_col = _basic_cdvs[child_info.upper_bound_idx_or_basic_col_idx];
           auto const single_level_list_hash = cudf::detail::accumulate(
-            thrust::counting_iterator(child_row_idx_begin),
-            thrust::counting_iterator(child_row_idx_end),
+            thrust::make_counting_iterator(top.get_idx_to_process()),
+            thrust::make_counting_iterator(curr_col_info.upper_bound_idx_or_basic_col_idx),
             HIVE_INIT_HASH,
             [child_col, hasher = _hash_functor] __device__(auto hash, auto element_index) {
               auto cur_hash = cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
@@ -443,16 +477,35 @@ class hive_device_row_hasher {
           top.update_cur_hash(single_level_list_hash);
           if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
         } else {
-          if (top.get_idx_to_process() == child_row_idx_end - child_row_idx_begin) {
+          if (top.get_idx_to_process() == curr_col_info.upper_bound_idx_or_basic_col_idx) {
             if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
           } else {
             // Push the next element into the stack
-            col_stack[stack_size++].init(child_col_idx,
-                                         child_row_idx_begin + top.get_and_inc_idx_to_process());
+            cudf::size_type child_idx_begin = 0;
+            if (child_info.type_id == cudf::type_id::LIST) {
+              auto const child_offsets_col_idx = child_col_idx + 1;
+              auto const child_offsets =
+                _basic_cdvs[_col_infos[child_offsets_col_idx].upper_bound_idx_or_basic_col_idx];
+              child_idx_begin =
+                child_offsets.template element<cudf::size_type>(top.get_idx_to_process());
+              child_info.upper_bound_idx_or_basic_col_idx =
+                child_offsets.template element<cudf::size_type>(top.get_idx_to_process() + 1);
+
+              // Ignore this child if it does not have any element.
+              if (child_info.upper_bound_idx_or_basic_col_idx == child_idx_begin) {
+                next_col_idx += 2;
+              }
+            }
+            if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
+              col_stack[stack_size++].init(
+                child_col_idx, top.get_idx_to_process(), child_idx_begin, child_info);
+            }
+            top.get_and_inc_idx_to_process();
           }
         }
       }
     }
+
     return col_stack[0].get_hash();
   }
 
@@ -506,11 +559,13 @@ void flatten_table(std::vector<col_info>& col_infos,
   column_processer_fn_t flatten_column = [&](cudf::column_view const& col) {
     auto const type_id = col.type().id();
     if (type_id == cudf::type_id::LIST) {
-      col_infos.emplace_back(col_info{type_id, col.num_children()});
+      // Nested size will be updated separately for each row.
+      col_infos.emplace_back(col_info{type_id, -1});
       auto const list_col = cudf::lists_column_view(col);
       flatten_column(list_col.offsets());
       flatten_column(list_col.get_sliced_child(stream));
     } else if (type_id == cudf::type_id::STRUCT) {
+      // Nested size for struct columns is number of children.
       col_infos.emplace_back(col_info{type_id, col.num_children()});
       auto const struct_col = cudf::structs_column_view(col);
       for (auto child_idx = 0; child_idx < col.num_children(); child_idx++) {

This patch reduces global memory access as much as possible. Let's see how the benchmark looks like.
I'm so tired and cannot check it carefully but all the tests pass.

Co-authored-by: Nghia Truong <[email protected]>

src/main/cpp/src/hive_hash.cu

ustcfy · 2025-01-07T13:26:44Z

src/main/cpp/src/hive_hash.cu

+              child_info.upper_bound_idx_or_basic_col_idx =
+                child_offsets.template element<cudf::size_type>(top.get_idx_to_process() + 1);
+
+              // Ignore this child if it does not have any element.


An empty list will also affect the hash value of its parent.
I added some tests for corner cases.

The code has become increasingly difficult to understand and maintain. 🤕

An empty list will also affect the hash value of its parent.

From the docs, I see that:

* hive_hash_value_t hive_hash(NestedType element) { * hive_hash_value_t hash = HIVE_INIT_HASH; * for (int i = 0; i < element.num_child(); i++) { * hash = hash * HIVE_HASH_FACTOR + hive_hash(element.get_child(i)); * } * return hash; * }

So when num_child (or num list elements) is 0, the for loop does not execute thus the returned value is HIVE_INIT_HASH. In my patch, this for loop is ignored completely if there is no children/list elements, and that yields basically the same output.

HIVE_INIT_HASH is 0. The hash value of structContainsNoChild itself is 0, but the hash values of struct(int1, structContainsNoChild, int2) and struct(int1, int2) are different. The current code is missing a call to top.update_cur_hash(structContainsNoChild_hash).

Actually, I am not satisfied with storing cur_hash in the stack_frame because this requires modifying the new stack top when popping.
I think that cur_hash can be replaced with the exponent of 31 (or the result of $31^{exp}$), which is read-only. This way, calculations can be performed only on basic types. Hive hash only involves addition and multiplication of INT32, so changing the order of operations should not affect the result.

The current code is missing a call to top.update_cur_hash(structContainsNoChild_hash).

This should fix that corner case:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu index e152bea33c..d8a33b3d35 100644 --- a/src/main/cpp/src/hive_hash.cu +++ b/src/main/cpp/src/hive_hash.cu @@ -448,6 +448,8 @@ class hive_device_row_hasher { if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) { col_stack[stack_size++].init( child_col_idx, curr_row_idx, child_idx_begin, child_info); + } else { + top.update_cur_hash(HIVE_INIT_HASH); } break; } @@ -499,6 +501,8 @@ class hive_device_row_hasher { if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) { col_stack[stack_size++].init( child_col_idx, top.get_idx_to_process(), child_idx_begin, child_info); + } else { + top.update_cur_hash(HIVE_INIT_HASH); } top.get_and_inc_idx_to_process(); }

Signed-off-by: Yan Feng <[email protected]>

Co-authored-by: Nghia Truong <[email protected]>

ttnghia

The new code reduces stack memory usage by 3X. This can reduce register pressure and lead to potentially higher occupancy as well as performance.

src/main/cpp/src/hive_hash.cu

Signed-off-by: Yan Feng <[email protected]>

ustcfy added 2 commits December 23, 2024 15:43

Optimize Hive hash computation for nested types

92921f1

Signed-off-by: Yan Feng <[email protected]>

Fix minor details

a77dff5

Signed-off-by: Yan Feng <[email protected]>

ustcfy changed the title ~~Optimize Hive hash computation for nested types~~ [Draft] Optimize Hive hash computation for nested types Dec 23, 2024

Fix minor details

97b9644

Signed-off-by: Yan Feng <[email protected]>

ustcfy requested a review from ttnghia December 23, 2024 08:42

thirtiseven reviewed Dec 23, 2024

View reviewed changes

src/main/cpp/benchmarks/hash.cu Show resolved Hide resolved

src/main/cpp/benchmarks/hash.cu Outdated Show resolved Hide resolved

ustcfy requested a review from res-life December 23, 2024 10:17

ustcfy self-assigned this Dec 23, 2024

ustcfy added 3 commits December 24, 2024 02:28

Fix minor details

1857bdd

Signed-off-by: Yan Feng <[email protected]>

Fix minor details

ad84406

Signed-off-by: Yan Feng <[email protected]>

Minor update

a3fe1c7

Signed-off-by: Yan Feng <[email protected]>

This comment was marked as outdated.

Sign in to view

sperlingxx reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

Fix: Resolve string bug

9b43b03

Signed-off-by: Yan Feng <[email protected]>

ttnghia reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ttnghia reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

Update src/main/cpp/src/hive_hash.cu

71f7df9

Co-authored-by: Nghia Truong <[email protected]>

ustcfy changed the title ~~[Draft] Optimize Hive hash computation for nested types~~ Optimize Hive hash computation for nested types Dec 25, 2024

sperlingxx reviewed Dec 26, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ttnghia reviewed Dec 29, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ttnghia reviewed Dec 29, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ustcfy and others added 2 commits December 30, 2024 10:24

Update src/main/cpp/src/hive_hash.cu

9408c5d

Co-authored-by: Nghia Truong <[email protected]>

Fix minor details

33f78ba

Signed-off-by: Yan Feng <[email protected]>

ustcfy and others added 2 commits January 4, 2025 00:44

Code format

6499308

Signed-off-by: Yan Feng <[email protected]>

Update src/main/cpp/src/hive_hash.cu

b785cc9

Co-authored-by: Nghia Truong <[email protected]>

res-life reviewed Jan 6, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 6, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 6, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 6, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ustcfy marked this pull request as ready for review January 6, 2025 07:08

Address some comments

6c73126

Signed-off-by: Yan Feng <[email protected]>

Remove class element_hasher_adapter

47b6d57

Signed-off-by: Yan Feng <[email protected]>

Update src/main/cpp/src/hive_hash.cu

7cc37c6

Co-authored-by: Nghia Truong <[email protected]>

ustcfy commented Jan 7, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

Update src/main/cpp/src/hive_hash.cu

9688edd

ustcfy commented Jan 7, 2025

View reviewed changes

ustcfy and others added 5 commits January 7, 2025 21:30

Add test for corner cases

25fd602

Signed-off-by: Yan Feng <[email protected]>

Update license

73aef29

Signed-off-by: Yan Feng <[email protected]>

Add test for corner cases

57854b1

Signed-off-by: Yan Feng <[email protected]>

Fix corner case when struct's num_children or list's num_element is 0

5f50d70

Co-authored-by: Nghia Truong <[email protected]>

Make the code more readable

ef50293

Co-authored-by: Nghia Truong <[email protected]>

ttnghia previously approved these changes Jan 10, 2025

View reviewed changes

res-life reviewed Jan 13, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 13, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 13, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 13, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

res-life reviewed Jan 13, 2025

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

Add constructor for col_info

90426b6

Signed-off-by: Yan Feng <[email protected]>

ustcfy dismissed ttnghia’s stale review via 90426b6 January 14, 2025 03:36

Combine check_nested_depth and flatten_table

b8744cd

Signed-off-by: Yan Feng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Hive hash` computation for nested types #2720

Optimize `Hive hash` computation for nested types #2720

ustcfy commented Dec 23, 2024 •

edited

Loading

thirtiseven left a comment

This comment was marked as outdated.

res-life Dec 24, 2024

res-life commented Dec 24, 2024

ttnghia commented Dec 24, 2024 •

edited

Loading

sperlingxx commented Dec 25, 2024 •

edited

Loading

res-life commented Dec 25, 2024 •

edited

Loading

sperlingxx commented Dec 26, 2024

sperlingxx left a comment

res-life commented Jan 6, 2025

ttnghia commented Jan 7, 2025

ttnghia commented Jan 7, 2025 •

edited

Loading

ustcfy Jan 7, 2025 •

edited

Loading

ustcfy Jan 7, 2025

ttnghia Jan 7, 2025 •

edited

Loading

ustcfy Jan 7, 2025

ustcfy Jan 7, 2025 •

edited

Loading

ttnghia Jan 8, 2025

ttnghia left a comment •

edited

Loading

Optimize Hive hash computation for nested types #2720

Are you sure you want to change the base?

Optimize Hive hash computation for nested types #2720

Conversation

ustcfy commented Dec 23, 2024 • edited Loading

Main Optimization:

Possible Further Optimization:

Benchmark:

thirtiseven left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

res-life Dec 24, 2024

Choose a reason for hiding this comment

res-life commented Dec 24, 2024

ttnghia commented Dec 24, 2024 • edited Loading

sperlingxx commented Dec 25, 2024 • edited Loading

res-life commented Dec 25, 2024 • edited Loading

sperlingxx commented Dec 26, 2024

sperlingxx left a comment

Choose a reason for hiding this comment

res-life commented Jan 6, 2025

ttnghia commented Jan 7, 2025

ttnghia commented Jan 7, 2025 • edited Loading

ustcfy Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

ustcfy Jan 7, 2025

Choose a reason for hiding this comment

ttnghia Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

ustcfy Jan 7, 2025

Choose a reason for hiding this comment

ustcfy Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

ttnghia Jan 8, 2025

Choose a reason for hiding this comment

ttnghia left a comment • edited Loading

Choose a reason for hiding this comment

Optimize `Hive hash` computation for nested types #2720

Optimize `Hive hash` computation for nested types #2720

ustcfy commented Dec 23, 2024 •

edited

Loading

ttnghia commented Dec 24, 2024 •

edited

Loading

sperlingxx commented Dec 25, 2024 •

edited

Loading

res-life commented Dec 25, 2024 •

edited

Loading

ttnghia commented Jan 7, 2025 •

edited

Loading

ustcfy Jan 7, 2025 •

edited

Loading

ttnghia Jan 7, 2025 •

edited

Loading

ustcfy Jan 7, 2025 •

edited

Loading

ttnghia left a comment •

edited

Loading