forked from NVIDIA/cub
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGE_LOG.TXT
227 lines (191 loc) · 13.2 KB
/
CHANGE_LOG.TXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
//-----------------------------------------------------------------------------
1.4.1 04/13/2015
- Bug fixes:
- Fixes for CUDA 7.0 issues with SHFL-based warp-scan and warp-reduction
on non-primitive data types (e.g., user-defined structs)
- Fixes for minor CUDA 7.0 performance regressions in cub::DeviceScan,
DeviceReduceByKey
- Fixes to allow cub::DeviceRadixSort and cub::BlockRadixSort on bool types
- Remove requirement for callers to define the CUB_CDP macro
when invoking CUB device-wide rountines using CUDA dynamic parallelism
- Fix for headers not being included in the proper order (or missing includes)
for some block-wide functions
//-----------------------------------------------------------------------------
1.4.0 03/18/2015
- New Features:
- Support and performance tuning for new Maxwell GPU architectures
- Updated cub::DeviceHistogram implementation that provides the same
"histogram-even" and "histogram-range" functionality as IPP/NPP.
Provides extremely fast and, perhaps more importantly, very
uniform performance response across diverse real-world datasets,
including pathological (homogeneous) sample distributions (resilience)
- New cub::DeviceSpmv methods for multiplying sparse matrices by
dense vectors, load-balanced using a merge-based parallel decomposition.
- New cub::DeviceRadixSort sorting entry-points that always return
the sorted output into the specified buffer (as opposed to the
cub::DoubleBuffer in which it could end up in either buffer)
- New cub::DeviceRunLengthEncode::NonTrivialRuns for finding the starting
offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in
a given sequence. (Useful for top-down partitioning algorithms like
MSD sorting of very-large keys.)
//-----------------------------------------------------------------------------
1.3.2 07/28/2014
- Bug fixes:
- Fix for cub::DeviceReduce where reductions of small problems
(small enough to only dispatch a single threadblock) would run in
the default stream (stream zero) regardless of whether an alternate
stream was specified.
//-----------------------------------------------------------------------------
1.3.1 05/23/2014
- Bug fixes:
- Workaround for a benign WAW race warning reported by cuda-memcheck
in BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm.
- Fix for bug in DeviceRadixSort where the algorithm may sort more
key bits than the caller specified (up to the nearest radix digit).
- Fix for ~3% DeviceRadixSort performance regression on Kepler and
Fermi that was introduced in v1.3.0.
//-----------------------------------------------------------------------------
1.3.0 05/12/2014
- New features:
- CUB's collective (block-wide, warp-wide) primitives underwent a minor
interface refactoring:
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized
by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being
optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the
constructors that accept remapped linear thread-identifiers have been
removed: all primitives now assume a row-major thread-ranking for
multidimensional thread blocks.
- To allow the host program (compiled by the host-pass) to
accurately determine the device-specific storage requirements for
a given collective (compiled for each device-pass), the interfaces
for collective classes are now (optionally) template-parameterized
by the desired PTX compute capability. This is useful when
aliasing collective storage to shared memory that has been
allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any
changes to accomodate these updates.
- Added new "combination" WarpScan methods for efficiently computing
both inclusive and exclusive prefix scans (and sums).
- Bug fixes:
- Fixed bug in cub::WarpScan (which affected cub::BlockScan and
cub::DeviceScan) where incorrect results (e.g., NAN) would often be
returned when parameterized for floating-point types (fp32, fp64).
- Workaround-fix for ptxas error when compiling with with -G flag on Linux
(for debug instrumentation)
- Misc. workaround-fixes for certain scan scenarios (using custom
scan operators) where code compiled for SM1x is run on newer
GPUs of higher compute-capability: the compiler could not tell
which memory space was being used collective operations and was
mistakenly using global ops instead of shared ops.
//-----------------------------------------------------------------------------
1.2.3 04/01/2014
- Bug fixes:
- Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types
- Fixed code-snippet bug in ArgIndexInputIteratorT documentation
//-----------------------------------------------------------------------------
1.2.2 03/03/2014
- New features:
- Added MS VC++ project solutions for device-wide and block-wide examples
- Performance:
- Added a third algorithmic variant of cub::BlockReduce for improved performance
when using commutative operators (e.g., numeric addition)
- Bug fixes:
- Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly
//-----------------------------------------------------------------------------
1.2.0 02/25/2014
- New features:
- Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode)
- Performance
- Improved DeviceScan, DeviceSelect, DevicePartition performance
- Documentation and testing:
- Compatible with CUDA 6.0
- Added performance-portability plots for many device-wide primitives to doc
- Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Bug fixes
- Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences)
- Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status
- Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block
- Added workaround for TexRefInputIteratorTto work with CUDA 6.0
//-----------------------------------------------------------------------------
1.1.1 12/11/2013
- New features:
- Added TexObjInputIteratorT, TexRefInputIteratorT, CacheModifiedInputIteratorT, and CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. Compatible with Thrust API.
- Added descending sorting to DeviceRadixSort and BlockRadixSort
- Added min, max, arg-min, and arg-max to DeviceReduce
- Added DeviceSelect (select-unique, select-if, and select-flagged)
- Added DevicePartition (partition-if, partition-flagged)
- Added generic cub::ShuffleUp(), cub::ShuffleDown(), and cub::ShuffleIndex() for warp-wide communication of arbitrary data types (SM3x+)
- Added cub::MaxSmOccupancy() for accurately determining SM occupancy for any given kernel function pointer
- Performance
- Improved DeviceScan and DeviceRadixSort performance for older architectures (SM10-SM30)
- Interface changes:
- Refactored block-wide I/O (BlockLoad and BlockStore), removing cache-modifiers from their interfaces. The CacheModifiedInputIteratorTand CacheModifiedOutputIterator should now be used with BlockLoad and BlockStore to effect that behavior.
- Rename device-wide "stream_synchronous" param to "debug_synchronous" to avoid confusion about usage
- Documentation and testing:
- Added simple examples of device-wide methods
- Improved doxygen documentation and example snippets
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform)
- Bug fixes
- Fixed misc DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM10-SM13
- Fixed DeviceScan / WarpReduction bug: SHFL-based segmented reduction producting incorrect results for multi-word types (size > 4B) on Linux
- Fixed BlockScan bug: For warpscan-based scans, not all threads in the first warp were entering the prefix callback functor
- Fixed DeviceRadixSort bug: race condition with key-value pairs for pre-SM35 architectures
- Fixed DeviceRadixSort bug: incorrect bitfield-extract behavior with long keys on 64bit Linux
- Fixed BlockDiscontinuity bug: complation error in for types other than int32/uint32
- CDP (device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts
//-----------------------------------------------------------------------------
1.0.2 08/23/2013
- Corrections to code snippet examples for BlockLoad, BlockStore, and BlockDiscontinuity
- Cleaned up unnecessary/missing header includes. You can now safely #inlude a specific .cuh (instead of cub.cuh)
- Bug/compilation fixes for BlockHistogram
//-----------------------------------------------------------------------------
1.0.1 08/08/2013
- New collective interface idiom (specialize::construct::invoke).
- Added best-in-class DeviceRadixSort. Implements short-circuiting for homogenous digit passes.
- Added best-in-class DeviceScan. Implements single-pass "adaptive-lookback" strategy.
- Significantly improved documentation (with example code snippets)
- More extensive regression test suit for aggressively testing collective variants
- Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types)
- Improved support for Kepler SHFL (collective ops now use SHFL for types larger than 32b)
- Better code generation for 64-bit addressing within BlockLoad/BlockStore
- DeviceHistogram now supports histograms of arbitrary bins
- Misc. fixes
- Workarounds for SM10 codegen issues in uncommonly-used WarpScan/Reduce specializations
- Updates to accommodate CUDA 5.5 dynamic parallelism
//-----------------------------------------------------------------------------
0.9.4 05/07/2013
- Fixed compilation errors for SM10-SM13
- Fixed compilation errors for some WarpScan entrypoints on SM30+
- Added block-wide histogram (BlockHistogram256)
- Added device-wide histogram (DeviceHistogram256)
- Added new BlockScan algorithm variant BLOCK_SCAN_RAKING_MEMOIZE, which
trades more register consumption for less shared memory I/O)
- Updates to BlockRadixRank to use BlockScan (which improves performance
on Kepler due to SHFL instruction)
- Allow types other than C++ primitives to be used in WarpScan::*Sum methods
if they only have operator + overloaded. (Previously they also required
to support assignment from int(0).)
- Update BlockReduce's BLOCK_REDUCE_WARP_REDUCTIONS algorithm to work even
when block size is not an even multiple of warp size
- Added work management utility descriptors (GridQueue, GridEvenShare)
- Refactoring of DeviceAllocator interface and CachingDeviceAllocator
implementation
- Misc. documentation updates and corrections.
//-----------------------------------------------------------------------------
0.9.2 04/04/2013
- Added WarpReduce. WarpReduce uses the SHFL instruction when applicable.
BlockReduce now uses this WarpReduce instead of implementing its own.
- Misc. fixes for 64-bit Linux compilation warnings and errors.
- Misc. documentation updates and corrections.
//-----------------------------------------------------------------------------
0.9.1 03/09/2013
- Fix for ambiguity in BlockScan::Reduce() between generic reduction and
summation. Summation entrypoints are now called ::Sum(), similar to the
convention in BlockScan.
- Small edits to mainpage documentation and download tracking
//-----------------------------------------------------------------------------
0.9.0 03/07/2013
- Intial "preview" release. CUB is the first durable, high-performance library
of cooperative block-level, warp-level, and thread-level primitives for CUDA
kernel programming. More primitives and examples coming soon!