diff --git a/CHANGE_LOG.TXT b/CHANGE_LOG.TXT index 8c0b9a89ad..d0e0398204 100644 --- a/CHANGE_LOG.TXT +++ b/CHANGE_LOG.TXT @@ -1,25 +1,39 @@ //----------------------------------------------------------------------------- -1.3.0 03/03/2014 +1.3.1 05/22/2014 - New features: - - CUB's collective (block-wide, warp-wide) primitives underwent a minor - interface refactoring: - - To provide the appropriate support for multidimensional thread blocks, - The interfaces for collective classes are now template-parameterized - by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being - optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the - constructors that accept remapped linear thread-identifiers have been - removed: all primitives now assume a row-major thread-ranking for - multidimensional thread blocks. - - To allow the host program (compiled by the host-pass) to - accurately determine the device-specific storage requirements for - a given collective (compiled for each device-pass), the interfaces - for collective classes are now (optionally) template-parameterized - by the desired PTX compute capability. This is useful when - aliasing collective storage to shared memory that has been - allocated dynamically by the host at the kernel call site. - - Most CUB programs having typical 1D usage should not require any - changes to accomodate these updates. + - Added new "combination" WarpScan methods for efficiently computing + both inclusive and exclusive prefix scans (and sums). + - Bug fixes: + - Workaround for a benign WAW race warning reported by cuda-memcheck + in BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm. + - Fix for bug in DeviceRadixSort where the algorithm may sort more + key bits than the caller specified (up to the nearest radix digit). + - Fix for ~3% DeviceRadixSort performance regression on Kepler and + Fermi that was introduced in v1.3.0. + +//----------------------------------------------------------------------------- + +1.3.0 05/12/2014 + - New features: + - CUB's collective (block-wide, warp-wide) primitives underwent a minor + interface refactoring: + - To provide the appropriate support for multidimensional thread blocks, + The interfaces for collective classes are now template-parameterized + by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being + optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the + constructors that accept remapped linear thread-identifiers have been + removed: all primitives now assume a row-major thread-ranking for + multidimensional thread blocks. + - To allow the host program (compiled by the host-pass) to + accurately determine the device-specific storage requirements for + a given collective (compiled for each device-pass), the interfaces + for collective classes are now (optionally) template-parameterized + by the desired PTX compute capability. This is useful when + aliasing collective storage to shared memory that has been + allocated dynamically by the host at the kernel call site. + - Most CUB programs having typical 1D usage should not require any + changes to accomodate these updates. - Bug fixes: - Fixed bug in cub::WarpScan (which affected cub::BlockScan and cub::DeviceScan) where incorrect results (e.g., NAN) would often be @@ -34,7 +48,7 @@ //----------------------------------------------------------------------------- -1.2.3 03/03/2014 +1.2.3 04/01/2014 - Bug fixes: - Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types - Fixed code-snippet bug in ArgIndexInputIterator documentation @@ -43,18 +57,25 @@ 1.2.2 03/03/2014 - New features: - - Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode) - Added MS VC++ project solutions for device-wide and block-wide examples - Performance: - - Improved DeviceScan, DeviceSelect, DevicePartition performance - Added a third algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition) + - Bug fixes: + - Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly + +//----------------------------------------------------------------------------- + +1.2.0 02/25/2014 + - New features: + - Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode) + - Performance + - Improved DeviceScan, DeviceSelect, DevicePartition performance - Documentation and testing: - Compatible with CUDA 6.0 - - Added performance-portabiltiy plots for many device-wide primitives to doc + - Added performance-portability plots for many device-wide primitives to doc - Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older). - - Bug fixes: - - Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly + - Bug fixes - Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences) - Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status - Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block diff --git a/cub/block/block_radix_rank.cuh b/cub/block/block_radix_rank.cuh index dc808a06cf..4b5a6a7615 100644 --- a/cub/block/block_radix_rank.cuh +++ b/cub/block/block_radix_rank.cuh @@ -198,13 +198,17 @@ private: UnsignedBits (&keys)[KEYS_PER_THREAD], // Key to decode DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value (out parameter) DigitCounter* (&digit_counters)[KEYS_PER_THREAD], // Counter smem offset (out parameter) - int current_bit) // The least-significant bit position of the current digit to extract + int current_bit, // The least-significant bit position of the current digit to extract + int num_bits) // The number of bits in the current digit { + // Get digit + UnsignedBits digit = BFE(keys[COUNT], current_bit, num_bits); + // Get sub-counter - UnsignedBits sub_counter = BFE(keys[COUNT], current_bit + LOG_COUNTER_LANES, LOG_PACKING_RATIO); + UnsignedBits sub_counter = digit >> LOG_COUNTER_LANES; // Get counter lane - UnsignedBits counter_lane = BFE(keys[COUNT], current_bit, LOG_COUNTER_LANES); + UnsignedBits counter_lane = digit & (COUNTER_LANES - 1); if (DESCENDING) { @@ -222,7 +226,7 @@ private: *digit_counters[COUNT] = thread_prefixes[COUNT] + 1; // Iterate next key - Iterate::DecodeKeys(cta, keys, thread_prefixes, digit_counters, current_bit); + Iterate::DecodeKeys(cta, keys, thread_prefixes, digit_counters, current_bit, num_bits); } @@ -253,7 +257,9 @@ private: UnsignedBits (&keys)[KEYS_PER_THREAD], DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], DigitCounter* (&digit_counters)[KEYS_PER_THREAD], - int current_bit) {} + int current_bit, // The least-significant bit position of the current digit to extract + int num_bits) // The number of bits in the current digit + {} // UpdateRanks @@ -261,7 +267,8 @@ private: static __device__ __forceinline__ void UpdateRanks( int (&ranks)[KEYS_PER_THREAD], DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], - DigitCounter *(&digit_counters)[KEYS_PER_THREAD]) {} + DigitCounter *(&digit_counters)[KEYS_PER_THREAD]) + {} }; @@ -416,7 +423,8 @@ public: __device__ __forceinline__ void RankKeys( UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile - int current_bit) ///< [in] The least-significant bit position of the current digit to extract + int current_bit, ///< [in] The least-significant bit position of the current digit to extract + int num_bits) ///< [in] The number of bits in the current digit { DigitCounter thread_prefixes[KEYS_PER_THREAD]; // For each key, the count of previous keys in this tile having the same digit DigitCounter* digit_counters[KEYS_PER_THREAD]; // For each key, the byte-offset of its corresponding digit counter in smem @@ -425,7 +433,7 @@ public: ResetCounters(); // Decode keys and update digit counters - Iterate<0, KEYS_PER_THREAD>::DecodeKeys(*this, keys, thread_prefixes, digit_counters, current_bit); + Iterate<0, KEYS_PER_THREAD>::DecodeKeys(*this, keys, thread_prefixes, digit_counters, current_bit, num_bits); __syncthreads(); @@ -449,10 +457,11 @@ public: UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile (out parameter) int current_bit, ///< [in] The least-significant bit position of the current digit to extract + int num_bits, ///< [in] The number of bits in the current digit int &inclusive_digit_prefix) ///< [out] The incluisve prefix sum for the digit threadIdx.x { // Rank keys - RankKeys(keys, ranks, current_bit); + RankKeys(keys, ranks, current_bit, num_bits); // Get the inclusive and exclusive digit totals corresponding to the calling thread. if ((BLOCK_THREADS == RADIX_DIGITS) || (linear_tid < RADIX_DIGITS)) diff --git a/cub/block/block_radix_sort.cuh b/cub/block/block_radix_sort.cuh index a415f3d973..36006b337c 100644 --- a/cub/block/block_radix_sort.cuh +++ b/cub/block/block_radix_sort.cuh @@ -220,12 +220,14 @@ private: UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD], int (&ranks)[ITEMS_PER_THREAD], int begin_bit, + int pass_bits, Int2Type is_descending) { AscendingBlockRadixRank(temp_storage.asending_ranking_storage).RankKeys( unsigned_keys, ranks, - begin_bit); + begin_bit, + pass_bits); } /// Rank keys (specialized for descending sort) @@ -299,9 +301,11 @@ private: // Radix sorting passes while (true) { + int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit); + // Rank the blocked keys int ranks[ITEMS_PER_THREAD]; - RankKeys(unsigned_keys, ranks, begin_bit, is_descending); + RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending); begin_bit += RADIX_BITS; __syncthreads(); @@ -349,9 +353,11 @@ private: // Radix sorting passes while (true) { + int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit); + // Rank the blocked keys int ranks[ITEMS_PER_THREAD]; - RankKeys(unsigned_keys, ranks, begin_bit, is_descending); + RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending); begin_bit += RADIX_BITS; __syncthreads(); diff --git a/cub/block_range/block_range_radix_sort_downsweep.cuh b/cub/block_range/block_range_radix_sort_downsweep.cuh index 50546a5b79..4141315ed9 100644 --- a/cub/block_range/block_range_radix_sort_downsweep.cuh +++ b/cub/block_range/block_range_radix_sort_downsweep.cuh @@ -242,6 +242,9 @@ struct BlockRangeRadixSortDownsweep // The least-significant bit position of the current digit to extract int current_bit; + // Number of bits in current digit + int num_bits; + // Whether to short-ciruit bool short_circuit; @@ -261,7 +264,7 @@ struct BlockRangeRadixSortDownsweep #pragma unroll for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) { - UnsignedBits digit = BFE(twiddled_keys[KEY], current_bit, RADIX_BITS); + UnsignedBits digit = BFE(twiddled_keys[KEY], current_bit, num_bits); // Lookup base digit offset from shared memory relative_bin_offsets[KEY] = temp_storage.relative_bin_offsets[digit]; @@ -522,6 +525,7 @@ struct BlockRangeRadixSortDownsweep twiddled_keys, ranks, current_bit, + num_bits, inclusive_digit_prefix); // Update global scatter base offsets for each digit @@ -639,7 +643,8 @@ struct BlockRangeRadixSortDownsweep Key *d_keys_out, Value *d_values_in, Value *d_values_out, - int current_bit) + int current_bit, + int num_bits) : temp_storage(temp_storage.Alias()), bin_offset(bin_offset), @@ -648,6 +653,7 @@ struct BlockRangeRadixSortDownsweep d_values_in(d_values_in), d_values_out(d_values_out), current_bit(current_bit), + num_bits(num_bits), short_circuit(false) {} @@ -663,14 +669,16 @@ struct BlockRangeRadixSortDownsweep Key *d_keys_out, Value *d_values_in, Value *d_values_out, - int current_bit) + int current_bit, + int num_bits) : temp_storage(temp_storage.Alias()), d_keys_in(reinterpret_cast(d_keys_in)), d_keys_out(reinterpret_cast(d_keys_out)), d_values_in(d_values_in), d_values_out(d_values_out), - current_bit(current_bit) + current_bit(current_bit), + num_bits(num_bits) { // Load digit bin offsets (each of the first RADIX_DIGITS threads will load an offset for that digit) if (threadIdx.x < RADIX_DIGITS) diff --git a/cub/block_range/block_range_radix_sort_upsweep.cuh b/cub/block_range/block_range_radix_sort_upsweep.cuh index efb2f7bd30..faadbd3f47 100644 --- a/cub/block_range/block_range_radix_sort_upsweep.cuh +++ b/cub/block_range/block_range_radix_sort_upsweep.cuh @@ -171,6 +171,9 @@ struct BlockRangeRadixSortUpsweep // The least-significant bit position of the current digit to extract int current_bit; + // Number of bits in current digit + int num_bits; + //--------------------------------------------------------------------- @@ -214,15 +217,17 @@ struct BlockRangeRadixSortUpsweep // Perform transform op UnsignedBits converted_key = Traits::TwiddleIn(key); - // Add in sub-counter offset - UnsignedBits sub_counter = BFE(converted_key, current_bit, LOG_PACKING_RATIO); + // Extract current digit bits + UnsignedBits digit = BFE(converted_key, current_bit, num_bits); + + // Get sub-counter offset + UnsignedBits sub_counter = digit & (PACKING_RATIO - 1); - // Add in row offset - UnsignedBits row_offset = BFE(converted_key, current_bit + LOG_PACKING_RATIO, LOG_COUNTER_LANES); + // Get row offset + UnsignedBits row_offset = digit >> LOG_PACKING_RATIO; // Increment counter temp_storage.digit_counters[row_offset][threadIdx.x][sub_counter]++; - } @@ -372,11 +377,13 @@ struct BlockRangeRadixSortUpsweep __device__ __forceinline__ BlockRangeRadixSortUpsweep( TempStorage &temp_storage, Key *d_keys_in, - int current_bit) + int current_bit, + int num_bits) : temp_storage(temp_storage.Alias()), d_keys_in(reinterpret_cast(d_keys_in)), - current_bit(current_bit) + current_bit(current_bit), + num_bits(num_bits) {} diff --git a/cub/device/dispatch/device_radix_sort_dispatch.cuh b/cub/device/dispatch/device_radix_sort_dispatch.cuh index bde3b9dbdb..028a5684ec 100644 --- a/cub/device/dispatch/device_radix_sort_dispatch.cuh +++ b/cub/device/dispatch/device_radix_sort_dispatch.cuh @@ -63,12 +63,13 @@ template < bool DESCENDING, ///< Whether or not the sorted-order is high-to-low typename Key, ///< Key type typename Offset> ///< Signed integer type for global offsets -__launch_bounds__ (int(BlockRangeRadixSortUpsweepPolicy::BLOCK_THREADS)) +__launch_bounds__ (int(BlockRangeRadixSortUpsweepPolicy::BLOCK_THREADS), 1) __global__ void RadixSortUpsweepKernel( Key *d_keys, ///< [in] Input keys buffer Offset *d_spine, ///< [out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.) Offset num_items, ///< [in] Total number of input data items int current_bit, ///< [in] Bit position of current radix digit + int num_bits, ///< [in] Number of bits of current radix digit bool first_pass, ///< [in] Whether this is the first digit pass GridEvenShare even_share) ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block { @@ -82,7 +83,7 @@ __global__ void RadixSortUpsweepKernel( even_share.BlockInit(); Offset bin_count; - BlockRangeRadixSortUpsweepT(temp_storage, d_keys, current_bit).ProcessRegion( + BlockRangeRadixSortUpsweepT(temp_storage, d_keys, current_bit, num_bits).ProcessRegion( even_share.block_offset, even_share.block_end, bin_count); @@ -136,23 +137,24 @@ __global__ void RadixSortScanKernel( * Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place. */ template < - typename BlockRangeRadixSortDownsweepPolicy, ///< Parameterizable tuning policy type for cub::BlockRangeRadixSortUpsweep abstraction - bool DESCENDING, ///< Whether or not the sorted-order is high-to-low - typename Key, ///< Key type - typename Value, ///< Value type - typename Offset> ///< Signed integer type for global offsets -__launch_bounds__ (int(BlockRangeRadixSortDownsweepPolicy::BLOCK_THREADS)) + typename BlockRangeRadixSortDownsweepPolicy, ///< Parameterizable tuning policy type for cub::BlockRangeRadixSortUpsweep abstraction + bool DESCENDING, ///< Whether or not the sorted-order is high-to-low + typename Key, ///< Key type + typename Value, ///< Value type + typename Offset> ///< Signed integer type for global offsets +__launch_bounds__ (int(BlockRangeRadixSortDownsweepPolicy::BLOCK_THREADS), 1) __global__ void RadixSortDownsweepKernel( - Key *d_keys_in, ///< [in] Input keys ping buffer - Key *d_keys_out, ///< [in] Output keys pong buffer - Value *d_values_in, ///< [in] Input values ping buffer - Value *d_values_out, ///< [in] Output values pong buffer - Offset *d_spine, ///< [in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.) - Offset num_items, ///< [in] Total number of input data items - int current_bit, ///< [in] Bit position of current radix digit - bool first_pass, ///< [in] Whether this is the first digit pass - bool last_pass, ///< [in] Whether this is the last digit pass - GridEvenShare even_share) ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block + Key *d_keys_in, ///< [in] Input keys ping buffer + Key *d_keys_out, ///< [in] Output keys pong buffer + Value *d_values_in, ///< [in] Input values ping buffer + Value *d_values_out, ///< [in] Output values pong buffer + Offset *d_spine, ///< [in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.) + Offset num_items, ///< [in] Total number of input data items + int current_bit, ///< [in] Bit position of current radix digit + int num_bits, ///< [in] Number of bits of current radix digit + bool first_pass, ///< [in] Whether this is the first digit pass + bool last_pass, ///< [in] Whether this is the last digit pass + GridEvenShare even_share) ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block { // Parameterize BlockRangeRadixSortDownsweep type for the current configuration typedef BlockRangeRadixSortDownsweep BlockRangeRadixSortDownsweepT; @@ -164,7 +166,7 @@ __global__ void RadixSortDownsweepKernel( even_share.BlockInit(); // Process input tiles - BlockRangeRadixSortDownsweepT(temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit).ProcessRegion( + BlockRangeRadixSortDownsweepT(temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion( even_share.block_offset, even_share.block_end); } @@ -209,7 +211,7 @@ struct DeviceRadixSortDispatch typedef typename If::Type AltUpsweepPolicy; // ScanPolicy - typedef BlockRangeScanPolicy <1024, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy; + typedef BlockRangeScanPolicy <1024, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_WARP_SCANS> ScanPolicy; // Primary DownsweepPolicy typedef BlockRangeRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_DIRECT, LOAD_LDG, false, true, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeEightByte, RADIX_BITS> DownsweepPolicyKeys; @@ -280,13 +282,13 @@ struct DeviceRadixSortDispatch typedef BlockRangeScanPolicy <512, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy; // DownsweepPolicy - typedef BlockRangeRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_RAKING_MEMOIZE, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyKeys; - typedef BlockRangeRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_RAKING_MEMOIZE, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyPairs; + typedef BlockRangeRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyKeys; + typedef BlockRangeRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyPairs; typedef typename If::Type DownsweepPolicy; // Alternate DownsweepPolicy for (RADIX_BITS-1)-bit passes - typedef BlockRangeRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_RAKING_MEMOIZE, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS - 1> AltDownsweepPolicyKeys; - typedef BlockRangeRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_RAKING_MEMOIZE, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS - 1> AltDownsweepPolicyPairs; + typedef BlockRangeRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS - 1> AltDownsweepPolicyKeys; + typedef BlockRangeRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS - 1> AltDownsweepPolicyPairs; typedef typename If::Type AltDownsweepPolicy; }; @@ -650,6 +652,8 @@ struct DeviceRadixSortDispatch int current_bit = begin_bit; while (current_bit < end_bit) { + int num_bits = CUB_MIN(end_bit - current_bit, downsweep_config.radix_bits); + #if (CUB_PTX_ARCH == 0) // Update smem config if necessary if (current_smem_config != upsweep_config.smem_config) @@ -670,6 +674,7 @@ struct DeviceRadixSortDispatch d_spine, num_items, current_bit, + num_bits, (current_bit == begin_bit), even_share); @@ -716,6 +721,7 @@ struct DeviceRadixSortDispatch d_spine, num_items, current_bit, + num_bits, (current_bit == begin_bit), (current_bit + downsweep_config.radix_bits >= end_bit), even_share); diff --git a/cub/device/dispatch/device_reduce_by_key_dispatch.cuh b/cub/device/dispatch/device_reduce_by_key_dispatch.cuh index 0a56560ebf..81c028e19e 100644 --- a/cub/device/dispatch/device_reduce_by_key_dispatch.cuh +++ b/cub/device/dispatch/device_reduce_by_key_dispatch.cuh @@ -64,7 +64,7 @@ template < typename ValueInputIterator, ///< Random-access input iterator type for values typename ValueOutputIterator, ///< Random-access output iterator type for values typename NumSegmentsIterator, ///< Output iterator type for recording number of segments encountered - typename ScanTileState, ///< Tile status interface type + typename ScanTileState, ///< Tile status interface type typename EqualityOp, ///< Key equality operator type typename ReductionOp, ///< Value reduction operator type typename Offset> ///< Signed integer type for global offsets @@ -75,7 +75,7 @@ __global__ void ReduceByKeyRegionKernel( ValueInputIterator d_values_in, ///< [in] Pointer to consecutive runs of input values ValueOutputIterator d_values_out, ///< [in] Pointer to output value aggregates (one aggregate per run) NumSegmentsIterator d_num_segments, ///< [in] Pointer to total number of runs - ScanTileState tile_status, ///< [in] Tile status interface + ScanTileState tile_status, ///< [in] Tile status interface EqualityOp equality_op, ///< [in] Key equality operator ReductionOp reduction_op, ///< [in] Value reduction operator Offset num_items, ///< [in] Total number of items to select from diff --git a/cub/warp/warp_reduce.cuh b/cub/warp/warp_reduce.cuh index 3f25a80b41..1cd3fe0cff 100644 --- a/cub/warp/warp_reduce.cuh +++ b/cub/warp/warp_reduce.cuh @@ -215,7 +215,7 @@ public: /** - * \brief Computes a warp-wide sum in each active warp. The output is valid in warp lane0. + * \brief Computes a warp-wide sum in the calling warp. The output is valid in warp lane0. * * \smemreuse * @@ -255,9 +255,9 @@ public: } /** - * \brief Computes a partially-full warp-wide sum in each active warp. The output is valid in warp lane0. + * \brief Computes a partially-full warp-wide sum in the calling warp. The output is valid in warp lane0. * - * All threads in each logical warp must agree on the same value for \p valid_items. Otherwise the result is undefined. + * All threads across the calling warp must agree on the same value for \p valid_items. Otherwise the result is undefined. * * \smemreuse * @@ -309,7 +309,7 @@ public: /** - * \brief Computes a segmented sum in each active warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). + * \brief Computes a segmented sum in the calling warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). * * \smemreuse * @@ -357,7 +357,7 @@ public: /** - * \brief Computes a segmented sum in each active warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). + * \brief Computes a segmented sum in the calling warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). * * \smemreuse * @@ -411,7 +411,7 @@ public: //@{ /** - * \brief Computes a warp-wide reduction in each active warp using the specified binary reduction functor. The output is valid in warp lane0. + * \brief Computes a warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. * * Supports non-commutative reduction operators * @@ -457,9 +457,9 @@ public: } /** - * \brief Computes a partially-full warp-wide reduction in each active warp using the specified binary reduction functor. The output is valid in warp lane0. + * \brief Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. * - * All threads in each logical warp must agree on the same value for \p valid_items. Otherwise the result is undefined. + * All threads across the calling warp must agree on the same value for \p valid_items. Otherwise the result is undefined. * * Supports non-commutative reduction operators * @@ -516,7 +516,7 @@ public: /** - * \brief Computes a segmented reduction in each active warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). + * \brief Computes a segmented reduction in the calling warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). * * Supports non-commutative reduction operators * @@ -567,7 +567,7 @@ public: /** - * \brief Computes a segmented reduction in each active warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). + * \brief Computes a segmented reduction in the calling warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). * * Supports non-commutative reduction operators * diff --git a/cub/warp/warp_scan.cuh b/cub/warp/warp_scan.cuh index b7e2c93c0a..ba604ec311 100644 --- a/cub/warp/warp_scan.cuh +++ b/cub/warp/warp_scan.cuh @@ -216,7 +216,7 @@ public: /** - * \brief Computes an inclusive prefix sum in each logical warp. + * \brief Computes an inclusive prefix sum across the calling warp. * * \smemreuse * @@ -257,7 +257,7 @@ public: /** - * \brief Computes an inclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * The \p warp_aggregate is undefined in threads other than warp-lane0. * @@ -302,7 +302,7 @@ public: /** - * \brief Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an inclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * The \p warp_aggregate is undefined in threads other than warp-lane0. * @@ -415,7 +415,7 @@ private: InternalWarpScan(temp_storage).Scan(input, inclusive_output, exclusive_output, identity, cub::Sum()); } - /// Computes an exclusive prefix sum in each logical warp. + /// Computes an exclusive prefix sum across the calling warp. __device__ __forceinline__ void ExclusiveSum(T input, T &output, Int2Type is_integer) { // Compute exclusive warp scan from inclusive warp scan @@ -424,7 +424,7 @@ private: output = inclusive - input; } - /// Computes an exclusive prefix sum in each logical warp. Specialized for non-integer types. + /// Computes an exclusive prefix sum across the calling warp. Specialized for non-integer types. __device__ __forceinline__ void ExclusiveSum(T input, T &output, Int2Type is_integer) { // Delegate to regular scan for non-integer types (because we won't be able to use subtraction) @@ -432,7 +432,7 @@ private: ExclusiveScan(input, output, identity, cub::Sum()); } - /// Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + /// Computes an exclusive prefix sum across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, Int2Type is_integer) { // Compute exclusive warp scan from inclusive warp scan @@ -441,7 +441,7 @@ private: output = inclusive - input; } - /// Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-integer types. + /// Computes an exclusive prefix sum across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-integer types. __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, Int2Type is_integer) { // Delegate to regular scan for non-integer types (because we won't be able to use subtraction) @@ -449,7 +449,7 @@ private: ExclusiveScan(input, output, identity, cub::Sum(), warp_aggregate); } - /// Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + /// Computes an exclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. template __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op, Int2Type is_integer) { @@ -459,7 +459,7 @@ private: output = inclusive - input; } - /// Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-integer types. + /// Computes an exclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-integer types. template __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, WarpPrefixCallbackOp &warp_prefix_op, Int2Type is_integer) { @@ -478,7 +478,7 @@ public: /** - * \brief Computes an exclusive prefix sum in each logical warp. + * \brief Computes an exclusive prefix sum across the calling warp. * * This operation assumes the value of obtained by the T's default * constructor (or by zero-initialization if no user-defined default @@ -525,7 +525,7 @@ public: /** - * \brief Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an exclusive prefix sum across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * This operation assumes the value of obtained by the T's default * constructor (or by zero-initialization if no user-defined default @@ -573,7 +573,7 @@ public: /** - * \brief Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an exclusive prefix sum across the calling warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * This operation assumes the value of obtained by the T's default * constructor (or by zero-initialization if no user-defined default @@ -668,7 +668,7 @@ public: //@{ /** - * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. + * \brief Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. * * Supports non-commutative scan operators. * @@ -715,7 +715,7 @@ public: /** - * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * Supports non-commutative scan operators. * @@ -767,7 +767,7 @@ public: /** - * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * The \p warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation. @@ -873,7 +873,7 @@ public: //@{ /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. * * Supports non-commutative scan operators. * @@ -922,7 +922,7 @@ public: /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * Supports non-commutative scan operators. * @@ -974,7 +974,7 @@ public: /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * The \p warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation. @@ -1083,7 +1083,7 @@ public: /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the \p output computed for warp-lane0 is undefined. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the \p output computed for warp-lane0 is undefined. * * Supports non-commutative scan operators. * @@ -1116,7 +1116,7 @@ public: * Suppose the set of input \p thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. * The corresponding output \p thread_data in the first warp would be * ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. - * (The output \p thread_data in each warp lane0 is undefined.) + * (The output \p thread_data in warp lane0 is undefined.) * * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) */ @@ -1132,7 +1132,7 @@ public: /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the \p output computed for warp-lane0 is undefined. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no identity value is supplied, the \p output computed for warp-lane0 is undefined. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. * * Supports non-commutative scan operators. * @@ -1166,7 +1166,7 @@ public: * Suppose the set of input \p thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. * The corresponding output \p thread_data in the first warp would be * ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. - * (The output \p thread_data in each warp lane0 is undefined.) Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads + * (The output \p thread_data in warp lane0 is undefined.) Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads * in the second warp, etc. * * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) @@ -1183,7 +1183,7 @@ public: /** - * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The \p warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide \p warp_aggregate of all inputs for thread-thread-lane0. + * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. The \p warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide \p warp_aggregate of all inputs for thread-thread-lane0. * * The \p warp_prefix_op functor must implement a member function T operator()(T warp_aggregate)}. * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation. @@ -1288,7 +1288,48 @@ public: *********************************************************************/ //@{ - /// Combination scan with identity + /** + * \brief Computes both inclusive and exclusive prefix sums across the calling warp. + * + * This operation assumes the value of obtained by the T's default + * constructor (or by zero-initialization if no user-defined default + * constructor exists) is suitable as the identity value "zero" for + * addition. + * + * \smemreuse + * + * \par Snippet + * The code snippet below illustrates four concurrent warp-wide prefix sums within a block of + * 128 threads (one per each of the 32-thread warps). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize WarpScan for type int + * typedef cub::WarpScan WarpScan; + * + * // Allocate WarpScan shared memory for 4 warps + * __shared__ typename WarpScan::TempStorage temp_storage[4]; + * + * // Obtain one input item per thread + * int thread_data = ... + * + * // Compute in|exclusive warp-wide prefix sums + * int inclusive_partial, exclusive_partial; + * int warp_id = threadIdx.x / 32; + * WarpScan(temp_storage[warp_id]).Sum(thread_data, inclusive_partial, exclusive_partial); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is {1, 1, 1, 1, ...}. + * The corresponding output \p inclusive_partial in each of the four warps of threads will be + * 1, 2, 3, ..., 32}. + * The corresponding output \p exclusive_partial in each of the four warps of threads will be + * 0, 1, 2, ..., 31}. + * + */ __device__ __forceinline__ void Sum( T input, ///< [in] Calling thread's input item. T &inclusive_output, ///< [out] Calling thread's inclusive-scan output item. @@ -1297,7 +1338,47 @@ public: Sum(input, inclusive_output, exclusive_output, Int2Type()); } - /// Combination scan with identity + + /** + * \brief Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. + * + * Supports non-commutative scan operators. + * + * \smemreuse + * + * \par Snippet + * The code snippet below illustrates four concurrent warp-wide prefix max scans within a block of + * 128 threads (one per each of the 32-thread warps). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize WarpScan for type int + * typedef cub::WarpScan WarpScan; + * + * // Allocate WarpScan shared memory for 4 warps + * __shared__ typename WarpScan::TempStorage temp_storage[4]; + * + * // Obtain one input item per thread + * int thread_data = ... + * + * // Compute inclusive warp-wide prefix max scans + * int warp_id = threadIdx.x / 32; + * int inclusive_partial, exclusive_partial; + * WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, INT_MIN, cub::Max()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. + * The corresponding output \p inclusive_partial in the first warp would be + * 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc. + * The corresponding output \p exclusive_partial in the first warp would be + * INT_MIN, 0, 0, 2, ..., 28, 30, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62, etc. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ template __device__ __forceinline__ void Scan( T input, ///< [in] Calling thread's input item. @@ -1309,7 +1390,47 @@ public: InternalWarpScan(temp_storage).Scan(input, inclusive_output, exclusive_output, identity, scan_op); } - /// Combination scan with without identity + + /** + * \brief Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no identity value is supplied, the \p exclusive_output computed for warp-lane0 is undefined. + * + * Supports non-commutative scan operators. + * + * \smemreuse + * + * \par Snippet + * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of + * 128 threads (one per each of the 32-thread warps). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize WarpScan for type int + * typedef cub::WarpScan WarpScan; + * + * // Allocate WarpScan shared memory for 4 warps + * __shared__ typename WarpScan::TempStorage temp_storage[4]; + * + * // Obtain one input item per thread + * int thread_data = ... + * + * // Compute exclusive warp-wide prefix max scans + * int inclusive_partial, exclusive_partial; + * WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, cub::Max()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. + * The corresponding output \p inclusive_partial in the first warp would be + * 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc. + * The corresponding output \p exclusive_partial in the first warp would be + * ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. + * (The output \p thread_data in warp lane0 is undefined.) + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ template __device__ __forceinline__ void Scan( T input, ///< [in] Calling thread's input item. diff --git a/docs/download_cub.html b/docs/download_cub.html index bce34c6d19..68ae20f4aa 100644 --- a/docs/download_cub.html +++ b/docs/download_cub.html @@ -37,14 +37,14 @@
If your download doesn't start in 3s:

- -Download CUB! + +Download CUB!
diff --git a/docs/html/CHANGE_LOG.TXT b/docs/html/CHANGE_LOG.TXT index 8c0b9a89ad..d0e0398204 100644 --- a/docs/html/CHANGE_LOG.TXT +++ b/docs/html/CHANGE_LOG.TXT @@ -1,25 +1,39 @@ //----------------------------------------------------------------------------- -1.3.0 03/03/2014 +1.3.1 05/22/2014 - New features: - - CUB's collective (block-wide, warp-wide) primitives underwent a minor - interface refactoring: - - To provide the appropriate support for multidimensional thread blocks, - The interfaces for collective classes are now template-parameterized - by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being - optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the - constructors that accept remapped linear thread-identifiers have been - removed: all primitives now assume a row-major thread-ranking for - multidimensional thread blocks. - - To allow the host program (compiled by the host-pass) to - accurately determine the device-specific storage requirements for - a given collective (compiled for each device-pass), the interfaces - for collective classes are now (optionally) template-parameterized - by the desired PTX compute capability. This is useful when - aliasing collective storage to shared memory that has been - allocated dynamically by the host at the kernel call site. - - Most CUB programs having typical 1D usage should not require any - changes to accomodate these updates. + - Added new "combination" WarpScan methods for efficiently computing + both inclusive and exclusive prefix scans (and sums). + - Bug fixes: + - Workaround for a benign WAW race warning reported by cuda-memcheck + in BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm. + - Fix for bug in DeviceRadixSort where the algorithm may sort more + key bits than the caller specified (up to the nearest radix digit). + - Fix for ~3% DeviceRadixSort performance regression on Kepler and + Fermi that was introduced in v1.3.0. + +//----------------------------------------------------------------------------- + +1.3.0 05/12/2014 + - New features: + - CUB's collective (block-wide, warp-wide) primitives underwent a minor + interface refactoring: + - To provide the appropriate support for multidimensional thread blocks, + The interfaces for collective classes are now template-parameterized + by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being + optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the + constructors that accept remapped linear thread-identifiers have been + removed: all primitives now assume a row-major thread-ranking for + multidimensional thread blocks. + - To allow the host program (compiled by the host-pass) to + accurately determine the device-specific storage requirements for + a given collective (compiled for each device-pass), the interfaces + for collective classes are now (optionally) template-parameterized + by the desired PTX compute capability. This is useful when + aliasing collective storage to shared memory that has been + allocated dynamically by the host at the kernel call site. + - Most CUB programs having typical 1D usage should not require any + changes to accomodate these updates. - Bug fixes: - Fixed bug in cub::WarpScan (which affected cub::BlockScan and cub::DeviceScan) where incorrect results (e.g., NAN) would often be @@ -34,7 +48,7 @@ //----------------------------------------------------------------------------- -1.2.3 03/03/2014 +1.2.3 04/01/2014 - Bug fixes: - Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types - Fixed code-snippet bug in ArgIndexInputIterator documentation @@ -43,18 +57,25 @@ 1.2.2 03/03/2014 - New features: - - Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode) - Added MS VC++ project solutions for device-wide and block-wide examples - Performance: - - Improved DeviceScan, DeviceSelect, DevicePartition performance - Added a third algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition) + - Bug fixes: + - Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly + +//----------------------------------------------------------------------------- + +1.2.0 02/25/2014 + - New features: + - Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode) + - Performance + - Improved DeviceScan, DeviceSelect, DevicePartition performance - Documentation and testing: - Compatible with CUDA 6.0 - - Added performance-portabiltiy plots for many device-wide primitives to doc + - Added performance-portability plots for many device-wide primitives to doc - Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older). - - Bug fixes: - - Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly + - Bug fixes - Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences) - Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status - Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block diff --git a/docs/html/annotated.html b/docs/html/annotated.html index bcbb2df3e2..93df51ac7c 100644 --- a/docs/html/annotated.html +++ b/docs/html/annotated.html @@ -228,7 +228,7 @@