Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Hive hash computation for nested types #2720

Open
wants to merge 44 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Dec 23, 2024

Main Optimization:

Flatten nested columns in advance, reducing the size of the stack_frame.

Possible Further Optimization:

  • Make various hash functions share the logic for flattening nested types, but xxhash64 does not require _cur_hash, which presents a challenge for unification.

Benchmark:

  1. Environment: NVIDIA TITAN RTX
  2. Sizes: 50MB, 100MB, 500MB, 1GB
  • schema: struct with a depth of max_depth, with the basic type being INT32, FLOAT32 and STRING
|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   50000000   |      1      | 609.823 us |      56.37% | 452.706 us |       8.10% |   -157.117 us | -25.76% |   FAST   |
|  100000000   |      1      | 958.429 us |      32.03% | 606.708 us |      66.13% |   -351.722 us | -36.70% |   FAST   |
|  500000000   |      1      |   3.981 ms |      10.85% |   2.091 ms |       1.65% |  -1889.419 us | -47.46% |   FAST   |
|  1000000000  |      1      |   7.716 ms |       5.89% |   3.929 ms |       0.50% |  -3786.579 us | -49.08% |   FAST   |
|   50000000   |      2      |   1.493 ms |       1.52% | 555.161 us |       5.56% |   -937.916 us | -62.82% |   FAST   |
|  100000000   |      2      |   2.760 ms |       0.61% | 892.648 us |       3.01% |  -1867.525 us | -67.66% |   FAST   |
|  500000000   |      2      |  13.026 ms |       0.39% |   3.539 ms |      16.46% |  -9487.762 us | -72.83% |   FAST   |
|  1000000000  |      2      |  25.811 ms |       0.28% |   6.790 ms |       0.59% | -19021.039 us | -73.69% |   FAST   |
|   50000000   |      4      |   2.530 ms |       7.86% |   1.094 ms |       0.66% |  -1436.823 us | -56.78% |   FAST   |
|  100000000   |      4      |   4.850 ms |       5.51% |   1.946 ms |       1.25% |  -2903.567 us | -59.87% |   FAST   |
|  500000000   |      4      |  23.405 ms |       0.13% |   8.761 ms |       0.47% | -14643.401 us | -62.57% |   FAST   |
|  1000000000  |      4      |  46.649 ms |       0.20% |  17.254 ms |       2.64% | -29395.025 us | -63.01% |   FAST   |
|   50000000   |      8      |   4.468 ms |       0.68% |   2.102 ms |       0.53% |  -2366.352 us | -52.96% |   FAST   |
|  100000000   |      8      |   8.731 ms |       0.44% |   3.953 ms |       0.42% |  -4777.431 us | -54.72% |   FAST   |
|  500000000   |      8      |  42.672 ms |       0.08% |  18.808 ms |       0.39% | -23863.937 us | -55.92% |   FAST   |
|  1000000000  |      8      |  85.191 ms |       0.05% |  37.338 ms |       0.30% | -47853.264 us | -56.17% |   FAST   |
  • schema: list with a depth of max_depth, with the basic type being INT32
|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   50000000   |      1      | 474.191 us |       5.38% | 467.645 us |      36.75% |      -6.546 us |  -1.38% |   SAME   |
|  100000000   |      1      | 682.398 us |      47.04% | 604.959 us |      65.52% |     -77.439 us | -11.35% |   SAME   |
|  500000000   |      1      |   2.523 ms |      10.46% |   2.630 ms |      11.76% |     106.458 us |   4.22% |   SAME   |
|  1000000000  |      1      |   4.800 ms |       0.20% |   3.370 ms |      10.34% |   -1429.947 us | -29.79% |   FAST   |
|   50000000   |      2      | 890.665 us |       1.14% | 867.720 us |       1.51% |     -22.946 us |  -2.58% |   FAST   |
|  100000000   |      2      |   1.858 ms |       0.55% |   1.832 ms |       1.02% |     -25.131 us |  -1.35% |   FAST   |
|  500000000   |      2      |   9.657 ms |       4.41% |   9.763 ms |       0.44% |     105.982 us |   1.10% |   SLOW   |
|  1000000000  |      2      |  19.158 ms |       2.73% |  19.139 ms |       0.26% |     -19.141 us |  -0.10% |   SAME   |
|   50000000   |      4      |   9.557 ms |       0.41% |   8.215 ms |       0.20% |   -1341.963 us | -14.04% |   FAST   |
|  100000000   |      4      |   9.596 ms |       0.20% |   8.235 ms |       0.21% |   -1360.763 us | -14.18% |   FAST   |
|  500000000   |      4      |  10.212 ms |       0.35% |   8.836 ms |       0.36% |   -1375.577 us | -13.47% |   FAST   |
|  1000000000  |      4      |  18.739 ms |       0.80% |  10.990 ms |       0.25% |   -7748.325 us | -41.35% |   FAST   |
|   50000000   |      8      |   3.643 us |      20.58% |   5.008 us |      12.68% |       1.365 us |  37.48% |   SLOW   |
|  100000000   |      8      |   3.612 us |      19.83% |   4.869 us |      13.84% |       1.257 us |  34.80% |   SLOW   |
|  500000000   |      8      |   19.112 s |        inf% |   18.441 s |        inf% | -671019.531 us |  -3.51% |   SAME   |
|  1000000000  |      8      |   19.252 s |        inf% |   18.581 s |        inf% | -670972.656 us |  -3.49% |   SAME   |
  • schema: list with a depth of max_depth, with the basic type being STRING
|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   50000000   |      1      |   1.225 ms |      34.18% |   1.238 ms |      34.57% |    12.566 us |   1.03% |   SAME   |
|  100000000   |      1      |   2.469 ms |       0.32% |   2.474 ms |       0.32% |     5.280 us |   0.21% |   SAME   |
|  500000000   |      1      |  11.670 ms |       0.13% |  11.631 ms |       0.48% |   -39.053 us |  -0.33% |   FAST   |
|  1000000000  |      1      |  23.225 ms |       0.32% |  23.167 ms |       0.24% |   -57.943 us |  -0.25% |   FAST   |
|   50000000   |      2      | 986.634 us |       0.99% | 959.146 us |       1.21% |   -27.489 us |  -2.79% |   FAST   |
|  100000000   |      2      |   4.743 ms |       0.87% |   4.753 ms |       0.84% |     9.306 us |   0.20% |   SAME   |
|  500000000   |      2      |  35.896 ms |       0.59% |  39.882 ms |       0.34% |     3.986 ms |  11.10% |   SLOW   |
|  1000000000  |      2      |  74.205 ms |       0.29% |  82.584 ms |       0.31% |     8.379 ms |  11.29% |   SLOW   |
|   50000000   |      4      |  33.784 ms |       0.11% |  32.820 ms |       0.10% |  -964.448 us |  -2.85% |   FAST   |
|  100000000   |      4      |  34.308 ms |       0.11% |  33.523 ms |       0.22% |  -784.787 us |  -2.29% |   FAST   |
|  500000000   |      4      |  34.769 ms |       0.22% |  34.006 ms |       0.12% |  -763.420 us |  -2.20% |   FAST   |
|  1000000000  |      4      |  35.852 ms |       0.16% |  34.705 ms |       0.11% | -1147.547 us |  -3.20% |   FAST   |
|   50000000   |      8      |   4.155 us |      18.74% |   4.042 us |      18.08% |    -0.113 us |  -2.72% |   SAME   |
|  100000000   |      8      |   3.960 us |      18.35% |   3.965 us |      19.00% |     0.005 us |   0.12% |   SAME   |
|  500000000   |      8      |   4.157 us |      24.39% |   4.064 us |      17.89% |    -0.092 us |  -2.22% |   SAME   |
|  1000000000  |      8      |   52.247 s |        inf% |   52.920 s |        inf% |   673.637 ms |   1.29% |   SAME   |

@ustcfy ustcfy changed the title Optimize Hive hash computation for nested types [Draft] Optimize Hive hash computation for nested types Dec 23, 2024
Signed-off-by: Yan Feng <[email protected]>
@ustcfy ustcfy requested a review from ttnghia December 23, 2024 08:42
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you post the benchmark results here?

Maybe also how it failed on the string type and how to reproduce it if you need help on that issue.

src/main/cpp/benchmarks/hash.cu Show resolved Hide resolved
src/main/cpp/benchmarks/hash.cu Outdated Show resolved Hide resolved
@ustcfy ustcfy requested a review from res-life December 23, 2024 10:17
@ustcfy ustcfy self-assigned this Dec 23, 2024
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
sperlingxx

This comment was marked as outdated.

@@ -486,15 +526,77 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,

check_nested_depth(input);

// `flattened_column_views` only contains nested columns and columns that result from flattening
// nested columns
std::vector<cudf::column_view> flattened_column_views;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column view constructor will calculate null count which is time consuming.
The original approach does not need to calculate null count.
We may need to find a way to avoid this column view array.
Please help check if the contiguous_copy_column_device_views is helpful?

@res-life
Copy link
Collaborator

Fixed the struct(string) type reports illegal memory access error:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index ca720b5..5e9bb35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -22,6 +22,7 @@
 #include <cudf/structs/structs_column_view.hpp>
 #include <cudf/table/experimental/row_operators.cuh>
 #include <cudf/table/table_device_view.cuh>
+#include <cudf/table/table_device_view.cuh>
 
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
@@ -566,17 +567,9 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     }
   }
 
-  std::vector<cudf::column_device_view> device_flattened_column_views;
-  device_flattened_column_views.reserve(flattened_column_views.size());
-
-  std::transform(
-    flattened_column_views.begin(),
-    flattened_column_views.end(),
-    std::back_inserter(device_flattened_column_views),
-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });
+  [[maybe_unused]] auto [device_view_owners, device_flattened_column_views] =
+    cudf::contiguous_copy_column_device_views<cudf::column_device_view>(flattened_column_views, stream);
 
-  auto flattened_column_device_views =
-    cudf::detail::make_device_uvector_async(device_flattened_column_views, stream, mr);
   auto first_child_index_view =
     cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
   auto nested_column_map_view =
@@ -594,7 +587,7 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     output_view.end<hive_hash_value_t>(),
     hive_device_row_hasher<hive_hash_function, bool>(nullable,
                                                      *input_view,
-                                                     flattened_column_device_views.data(),
+                                                     device_flattened_column_views,
                                                      first_child_index_view.data(),
                                                      nested_column_map_view.data()));

Please apply the above patch.

Signed-off-by: Yan Feng <[email protected]>
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 24, 2024

-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });

Remember to never dereference the output from column_device_view::create or table_device_view::create directly. They return a pointer. If we don't store that pointer, it will be dangled right away thus the reference to it will be invalidiated.

@ustcfy ustcfy changed the title [Draft] Optimize Hive hash computation for nested types Optimize Hive hash computation for nested types Dec 25, 2024
@sperlingxx
Copy link
Collaborator

sperlingxx commented Dec 25, 2024

I just came up with some random thought:
How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result.
By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

@res-life
Copy link
Collaborator

res-life commented Dec 25, 2024

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types.
But for list type, e.g., list(int) column:
row0: [1,2,3]
row1: [1,2]
row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is:
[[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3))
[[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea.
We can discuss this further more.

@sperlingxx
Copy link
Collaborator

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)
Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types. But for list type, e.g., list(int) column: row0: [1,2,3] row1: [1,2] row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is: [[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3)) [[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea. We can discuss this further more.

Yes, you are right! In terms of list type, it seems impossible to get rid of stack operations.

Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, except the some NIT improvements

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
ustcfy and others added 2 commits December 30, 2024 10:24
ustcfy and others added 2 commits January 4, 2025 00:44
Signed-off-by: Yan Feng <[email protected]>
@ustcfy ustcfy marked this pull request as ready for review January 6, 2025 07:08
@res-life
Copy link
Collaborator

res-life commented Jan 6, 2025

We do not need: class element_hasher_adapter

   private:
    hash_functor_t hash_functor;
    hive_device_row_hasher const& _parent;

Move hash_functor into hive_device_row_hasher, then _parent is not needed.

Signed-off-by: Yan Feng <[email protected]>
@ttnghia
Copy link
Collaborator

ttnghia commented Jan 7, 2025

We do not need: class element_hasher_adapter

This is a good idea. It can save a register by eliminating reference variable _parent. By doing so, we just move all the opereator() of the _hasher_adapter into the parent struct.

@ttnghia
Copy link
Collaborator

ttnghia commented Jan 7, 2025

Here is another optimization:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index 7ce8026f22..49b04de86c 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -39,13 +39,6 @@ using hive_hash_value_t = int32_t;
 constexpr hive_hash_value_t HIVE_HASH_FACTOR = 31;
 constexpr hive_hash_value_t HIVE_INIT_HASH   = 0;
 
-struct col_info {
-  cudf::type_id type_id;
-  cudf::size_type
-    nested_num_children_or_basic_col_idx;  // Number of children for nested types, or column index
-                                           // in `basic_cdvs` for basic types
-};
-
 hive_hash_value_t __device__ inline compute_int(int32_t key) { return key; }
 
 hive_hash_value_t __device__ inline compute_long(int64_t key)
@@ -158,6 +151,18 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper
   return static_cast<hive_hash_value_t>(result);
 }
 
+/**
+ * @brief The struct storing column's auxiliary information.
+ */
+struct col_info {
+  // Column type id.
+  cudf::type_id type_id;
+
+  // Store the the upper bound of number of elements for lists column, or the upper bound of
+  // number of children for structs column, or column index in `basic_cdvs` for basic types.
+  cudf::size_type upper_bound_idx_or_basic_col_idx;
+};
+
 /**
  * @brief Computes the hash value of a row in the given table.
  *
@@ -205,11 +210,11 @@ class hive_device_row_hasher {
         auto const col_info        = _col_infos[flattened_index];
         auto const col_hash =
           (col_info.type_id == cudf::type_id::LIST || col_info.type_id == cudf::type_id::STRUCT)
-            ? hash_nested(flattened_index, row_index)
+            ? hash_nested(flattened_index, row_index, col_info)
             : cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
                 cudf::data_type{col_info.type_id},
                 _hash_functor,
-                _basic_cdvs[col_info.nested_num_children_or_basic_col_idx],
+                _basic_cdvs[col_info.upper_bound_idx_or_basic_col_idx],
                 row_index);
         return HIVE_HASH_FACTOR * hash + col_hash;
       });
@@ -221,19 +226,24 @@ class hive_device_row_hasher {
    */
   struct col_stack_frame {
    private:
-    cudf::size_type _col_idx;     // the column index in the flattened array
-    cudf::size_type _row_idx;     // the index of the row in the column
-    int _idx_to_process;          // the index of child or element to process next
-    hive_hash_value_t _cur_hash;  // current hash value of the column
+    col_info _col_info;               // the column info
+    cudf::size_type _col_idx;         // the column index in the flattened array
+    cudf::size_type _row_idx;         // the index of the row in the column
+    cudf::size_type _idx_to_process;  // the index of child or element to process next
+    hive_hash_value_t _cur_hash;      // current hash value of the column
 
    public:
     __device__ col_stack_frame() = default;
 
-    __device__ void init(cudf::size_type col_index, cudf::size_type row_idx)
+    __device__ void init(cudf::size_type col_index,
+                         cudf::size_type row_idx,
+                         cudf::size_type idx_begin,
+                         col_info info)
     {
       _col_idx        = col_index;
       _row_idx        = row_idx;
-      _idx_to_process = 0;
+      _idx_to_process = idx_begin;
+      _col_info       = info;
       _cur_hash       = HIVE_INIT_HASH;
     }
 
@@ -248,6 +258,8 @@ class hive_device_row_hasher {
 
     __device__ int get_idx_to_process() const { return _idx_to_process; }
 
+    __device__ col_info get_col_info() const { return _col_info; }
+
     __device__ cudf::size_type get_col_idx() const { return _col_idx; }
 
     __device__ cudf::size_type get_row_idx() const { return _row_idx; }
@@ -366,74 +378,96 @@ class hive_device_row_hasher {
    *
    * @param flattened_index The index of the column in the flattened array
    * @param row_index The index of the row to compute the hash for
+   * @param curr_col_info the column's information
    * @return The computed hive hash value
    */
   __device__ hive_hash_value_t hash_nested(cudf::size_type flattened_index,
-                                           cudf::size_type row_index) const noexcept
+                                           cudf::size_type row_index,
+                                           col_info curr_col_info) const noexcept
   {
     auto next_col_idx = flattened_index + 1;
 
     col_stack_frame col_stack[MAX_STACK_DEPTH];
     int stack_size = 0;
-    col_stack[stack_size++].init(flattened_index, row_index);
+
+    // If the current column is a lists column, we need to store the upper bound row offset.
+    // Otherwise, it is a structs column and already stores the number of children.
+    cudf::size_type curr_idx_begin = 0;
+    if (curr_col_info.type_id == cudf::type_id::LIST) {
+      auto const offsets = _basic_cdvs[_col_infos[next_col_idx].upper_bound_idx_or_basic_col_idx];
+      curr_idx_begin     = offsets.template element<cudf::size_type>(row_index);
+      curr_col_info.upper_bound_idx_or_basic_col_idx =
+        offsets.template element<cudf::size_type>(row_index + 1);
+    }
+    if (curr_col_info.upper_bound_idx_or_basic_col_idx == curr_idx_begin) { return HIVE_INIT_HASH; }
+
+    col_stack[stack_size++].init(flattened_index, row_index, curr_idx_begin, curr_col_info);
 
     while (stack_size > 0) {
       col_stack_frame& top     = col_stack[stack_size - 1];
       auto const curr_col_idx  = top.get_col_idx();
       auto const curr_row_idx  = top.get_row_idx();
-      auto const curr_col_info = _col_infos[curr_col_idx];
+      auto const curr_col_info = top.get_col_info();
       // Do not pop it until it is processed. The definition of `processed` is:
       // - For structs, it is when all child columns are processed.
       // - For lists, it is when all elements in the list are processed.
       if (curr_col_info.type_id == cudf::type_id::STRUCT) {
-        if (top.get_idx_to_process() == curr_col_info.nested_num_children_or_basic_col_idx) {
+        if (top.get_idx_to_process() == curr_col_info.upper_bound_idx_or_basic_col_idx) {
           if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
         } else {
           // Reset `next_col_idx` to keep track of the struct's children index.
           if (top.get_idx_to_process() == 0) { next_col_idx = curr_col_idx + 1; }
-          while (top.get_idx_to_process() < curr_col_info.nested_num_children_or_basic_col_idx) {
+
+          while (top.get_idx_to_process() < curr_col_info.upper_bound_idx_or_basic_col_idx) {
             top.get_and_inc_idx_to_process();
             auto const child_col_idx = next_col_idx++;
-            auto const child_info    = _col_infos[child_col_idx];
+            auto child_info          = _col_infos[child_col_idx];
             // If the child is of primitive type, accumulate child hash into struct hash
             if (child_info.type_id != cudf::type_id::LIST &&
                 child_info.type_id != cudf::type_id::STRUCT) {
-              auto const child_col = _basic_cdvs[child_info.nested_num_children_or_basic_col_idx];
+              auto const child_col = _basic_cdvs[child_info.upper_bound_idx_or_basic_col_idx];
               auto const child_hash =
                 cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
                   child_col.type(), _hash_functor, child_col, curr_row_idx);
               top.update_cur_hash(child_hash);
             } else {
-              col_stack[stack_size++].init(child_col_idx, curr_row_idx);
+              cudf::size_type child_idx_begin = 0;
+              if (child_info.type_id == cudf::type_id::LIST) {
+                auto const child_offsets_col_idx = child_col_idx + 1;
+                auto const child_offsets =
+                  _basic_cdvs[_col_infos[child_offsets_col_idx].upper_bound_idx_or_basic_col_idx];
+                child_idx_begin = child_offsets.template element<cudf::size_type>(curr_row_idx);
+                child_info.upper_bound_idx_or_basic_col_idx =
+                  child_offsets.template element<cudf::size_type>(curr_row_idx + 1);
+
+                // Ignore this child if it does not have any element.
+                if (child_info.upper_bound_idx_or_basic_col_idx == child_idx_begin) {
+                  next_col_idx += 2;
+                }
+              }
+              if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
+                col_stack[stack_size++].init(
+                  child_col_idx, curr_row_idx, child_idx_begin, child_info);
+              }
               break;
             }
           }
         }
       } else if (curr_col_info.type_id == cudf::type_id::LIST) {
-        // Get the child column of the list column
-        auto const offsets_col_idx = curr_col_idx + 1;
-        auto const child_col_idx   = curr_col_idx + 2;
+        auto const child_col_idx = curr_col_idx + 2;
+        auto child_info          = _col_infos[child_col_idx];
 
         // Move `next_col_idx` forward pass the current lists column.
         // Children of a lists column always stay next to it and are not tracked by this.
         if (next_col_idx <= child_col_idx) { next_col_idx = child_col_idx + 1; }
 
-        auto const offsets_col =
-          _basic_cdvs[_col_infos[offsets_col_idx].nested_num_children_or_basic_col_idx];
-
-        auto const child_col_info = _col_infos[child_col_idx];
-        auto const child_row_idx_begin =
-          offsets_col.template element<cudf::size_type>(curr_row_idx);
-        auto const child_row_idx_end =
-          offsets_col.template element<cudf::size_type>(curr_row_idx + 1);
-
         // If the child column is of primitive type, directly compute the hash value of the list
-        if (child_col_info.type_id != cudf::type_id::LIST &&
-            child_col_info.type_id != cudf::type_id::STRUCT) {
-          auto const child_col = _basic_cdvs[child_col_info.nested_num_children_or_basic_col_idx];
+        if (child_info.type_id != cudf::type_id::LIST &&
+            child_info.type_id != cudf::type_id::STRUCT) {
+          auto const child_col = _basic_cdvs[child_info.upper_bound_idx_or_basic_col_idx];
           auto const single_level_list_hash = cudf::detail::accumulate(
-            thrust::counting_iterator(child_row_idx_begin),
-            thrust::counting_iterator(child_row_idx_end),
+            thrust::make_counting_iterator(top.get_idx_to_process()),
+            thrust::make_counting_iterator(curr_col_info.upper_bound_idx_or_basic_col_idx),
             HIVE_INIT_HASH,
             [child_col, hasher = _hash_functor] __device__(auto hash, auto element_index) {
               auto cur_hash = cudf::type_dispatcher<cudf::experimental::dispatch_void_if_nested>(
@@ -443,16 +477,35 @@ class hive_device_row_hasher {
           top.update_cur_hash(single_level_list_hash);
           if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
         } else {
-          if (top.get_idx_to_process() == child_row_idx_end - child_row_idx_begin) {
+          if (top.get_idx_to_process() == curr_col_info.upper_bound_idx_or_basic_col_idx) {
             if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
           } else {
             // Push the next element into the stack
-            col_stack[stack_size++].init(child_col_idx,
-                                         child_row_idx_begin + top.get_and_inc_idx_to_process());
+            cudf::size_type child_idx_begin = 0;
+            if (child_info.type_id == cudf::type_id::LIST) {
+              auto const child_offsets_col_idx = child_col_idx + 1;
+              auto const child_offsets =
+                _basic_cdvs[_col_infos[child_offsets_col_idx].upper_bound_idx_or_basic_col_idx];
+              child_idx_begin =
+                child_offsets.template element<cudf::size_type>(top.get_idx_to_process());
+              child_info.upper_bound_idx_or_basic_col_idx =
+                child_offsets.template element<cudf::size_type>(top.get_idx_to_process() + 1);
+
+              // Ignore this child if it does not have any element.
+              if (child_info.upper_bound_idx_or_basic_col_idx == child_idx_begin) {
+                next_col_idx += 2;
+              }
+            }
+            if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
+              col_stack[stack_size++].init(
+                child_col_idx, top.get_idx_to_process(), child_idx_begin, child_info);
+            }
+            top.get_and_inc_idx_to_process();
           }
         }
       }
     }
+
     return col_stack[0].get_hash();
   }
 
@@ -506,11 +559,13 @@ void flatten_table(std::vector<col_info>& col_infos,
   column_processer_fn_t flatten_column = [&](cudf::column_view const& col) {
     auto const type_id = col.type().id();
     if (type_id == cudf::type_id::LIST) {
-      col_infos.emplace_back(col_info{type_id, col.num_children()});
+      // Nested size will be updated separately for each row.
+      col_infos.emplace_back(col_info{type_id, -1});
       auto const list_col = cudf::lists_column_view(col);
       flatten_column(list_col.offsets());
       flatten_column(list_col.get_sliced_child(stream));
     } else if (type_id == cudf::type_id::STRUCT) {
+      // Nested size for struct columns is number of children.
       col_infos.emplace_back(col_info{type_id, col.num_children()});
       auto const struct_col = cudf::structs_column_view(col);
       for (auto child_idx = 0; child_idx < col.num_children(); child_idx++) {

This patch reduces global memory access as much as possible. Let's see how the benchmark looks like.
I'm so tired and cannot check it carefully but all the tests pass.

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
child_info.upper_bound_idx_or_basic_col_idx =
child_offsets.template element<cudf::size_type>(top.get_idx_to_process() + 1);

// Ignore this child if it does not have any element.
Copy link
Collaborator Author

@ustcfy ustcfy Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty list will also affect the hash value of its parent.
I added some tests for corner cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has become increasingly difficult to understand and maintain. 🤕

Copy link
Collaborator

@ttnghia ttnghia Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty list will also affect the hash value of its parent.

From the docs, I see that:

   * hive_hash_value_t hive_hash(NestedType element) {
   *    hive_hash_value_t hash = HIVE_INIT_HASH;
   *    for (int i = 0; i < element.num_child(); i++) {
   *        hash = hash * HIVE_HASH_FACTOR + hive_hash(element.get_child(i));
   *    }
   *    return hash;
   * }

So when num_child (or num list elements) is 0, the for loop does not execute thus the returned value is HIVE_INIT_HASH. In my patch, this for loop is ignored completely if there is no children/list elements, and that yields basically the same output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE_INIT_HASH is 0. The hash value of structContainsNoChild itself is 0, but the hash values of struct(int1, structContainsNoChild, int2) and struct(int1, int2) are different. The current code is missing a call to top.update_cur_hash(structContainsNoChild_hash).

Copy link
Collaborator Author

@ustcfy ustcfy Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not satisfied with storing cur_hash in the stack_frame because this requires modifying the new stack top when popping.
I think that cur_hash can be replaced with the exponent of 31 (or the result of $31^{exp}$), which is read-only. This way, calculations can be performed only on basic types. Hive hash only involves addition and multiplication of INT32, so changing the order of operations should not affect the result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is missing a call to top.update_cur_hash(structContainsNoChild_hash).

This should fix that corner case:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index e152bea33c..d8a33b3d35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -448,6 +448,8 @@ class hive_device_row_hasher {
               if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
                 col_stack[stack_size++].init(
                   child_col_idx, curr_row_idx, child_idx_begin, child_info);
+              } else {
+                top.update_cur_hash(HIVE_INIT_HASH);
               }
               break;
             }
@@ -499,6 +501,8 @@ class hive_device_row_hasher {
             if (child_info.upper_bound_idx_or_basic_col_idx > child_idx_begin) {
               col_stack[stack_size++].init(
                 child_col_idx, top.get_idx_to_process(), child_idx_begin, child_info);
+            } else {
+              top.update_cur_hash(HIVE_INIT_HASH);
             }
             top.get_and_inc_idx_to_process();
           }

ttnghia
ttnghia previously approved these changes Jan 10, 2025
Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code reduces stack memory usage by 3X. This can reduce register pressure and lead to potentially higher occupancy as well as performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants