Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
【MetaX】Merge Metax's modifications to mxmaca/2.6 branch (#68534)
* fix windows bug for common lib (#60308) * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * Update inference_lib.cmake * [Dy2St] Disable `test_bert` on CPU (#60173) (#60324) Co-authored-by: gouzil <[email protected]> * [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184) * fix weight-only quant kernel error for n div 64 !=0 * code style fix * tile (#60261) * add chunk allocator posix_memalign return value check (#60208) (#60495) * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * update 2023 security advisory, test=document_fix (#60532) * fix fleetutil get_online_pass_interval bug2; test=develop (#60545) * fix fused_rope diff (#60217) (#60593) * [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620) * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * [cherry-pick]update pdsa-2023-019 (#60649) * update 2023 security advisory, test=document_fix * update pdsa-2023-019, test=document_fix * [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662) * fix bug of ci (#59926) (#60785) * [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786) * [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README * [Docs] Update latest release version in README (#60691) * restore order * [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875) * [Cherry-pick] fix set_value with scalar grad (#60930) * Fix set value grad (#59034) * first fix the UT * fix set value grad * polish code * add static mode backward test * always has input valuetensor * add dygraph test * Fix shape error in combined-indexing setitem (#60447) * add ut * fix shape error in combine-indexing * fix ut * Set value with scalar (#60452) * set_value with scalar * fix ut * remove test_pir * remove one test since 2.6 not support uint8-add * [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772) * fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067) * fix qat tests (#61211) (#61284) * [Security] fix draw security problem (#61161) (#61338) * fix draw security problem * fix _decompress security problem (#61294) (#61337) * Fix CVE-2024-0521 (#61032) (#61287) This uses shlex for safe command parsing to fix arbitrary code injection Co-authored-by: ndren <[email protected]> * [Security] fix security problem for prune_by_memory_estimation (#61382) * OS Command Injection prune_by_memory_estimation fix * Fix StyleCode * [Security] fix security problem for run_cmd (#61285) (#61398) * fix security problem for run_cmd * [Security] fix download security problem (#61162) (#61388) * fix download security problem * check eval for security (#61389) * [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045) Co-authored-by: Tian <[email protected]> * [CherryPick] Fix issue 60092 (#61427) * fix issue 60092 * update * update * update * Fix unique (#60840) (#61044) * fix unique kernel, row to num_out * cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586) * remove _wget (#61356) (#61569) * remove _wget * remove _wget * remove wget test * fix layer_norm decompose dtyte bugs, polish codes (#61631) * fix doc style (#61688) * merge (#61866) * [security] refine _get_program_cache_key (#61827) (#61896) * security, refine _get_program_cache_key * repeat_interleave support bf16 dtype (#61854) (#61899) * repeat_interleave support bf16 dtype * support bf16 on cpu * Support Fake GroupWise Quant (#61900) * fix launch when elastic run (#61847) (#61878) * [Paddle-TRT] fix solve (#61806) * [Cherry-Pick] Fix CacheKV Quant Bug (#61966) * fix cachekv quant problem * add unittest * Sychronized the paddle2.4 adaptation changes * clear third_part dependencies * change submodules to right commits * build pass with cpu only * build success with maca * build success with cutlass and fused kernels * build with flash_attn and mccl * build with test, fix some bugs * fix some bugs * fixed some compilation bugs * fix bug in previous commit * fix bug with split when col_size biger than 256 * add row_limit to show full kernel name * add env.sh Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1 * add shape record Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738 * modify paddle version Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98 * wuzhao optimized the performance of elementwise kernel. Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c * fix split when dtype is fp16 Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597 * fix bug in previous commit Change-Id: I0fa66120160374da5a774ef2c04f133a54517069 * adapt flash_attn new capi Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba * change eigen path Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c * modify mcname -> replaced_name Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e * fix some build bugs Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d * add PADDLE_ENABLE_SAME_RAND_A100 Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61 done * remove redundant warning, add patch from 2.6.1 Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa * improve VectorizedBroadcastKernel (cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536) Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c Signed-off-by: m00891 <[email protected]> * fix bugs (cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c) Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1 Signed-off-by: m00891 <[email protected]> * split ElementwiseDivGrad (cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f) Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6 Signed-off-by: m00891 <[email protected]> * in VectorizedElementwiseKernel, it can now use vecSize = 8 (cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5) Change-Id: Ia703b1e9e959558988fcd09182387da839d33922 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches. (cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb) Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d_grad compute (InputGrad): 1.use shared memory to optimize data load from global memory; 2.different blocksize for different input shape 3.FastDivMod for input shape div, >> and & for stride div. (cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20) Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229 Signed-off-by: m00891 <[email protected]> * improve VectorizedBroadcastKernel with LoadType = 2(kMixed) (cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f) Change-Id: I282dd8284a7cde54061780a22b397133303f51e5 Signed-off-by: m00891 <[email protected]> * fix ElementwiseDivGrad (cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb) Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c Signed-off-by: m00891 <[email protected]> * Revert "Optimize depthwise_conv2d_grad compute (InputGrad):" This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20. (cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb) Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766 Signed-off-by: m00891 <[email protected]> * improve ElementwiseDivGrad and ElementwiseMulGrad (cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67) Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c Signed-off-by: m00891 <[email protected]> * improve FilterBBoxes (cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05) Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45 Signed-off-by: m00891 <[email protected]> * improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43) Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8) Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732 Signed-off-by: m00891 <[email protected]> * improve KeBNBackwardData:replace 1.0/sqrt with rsqrt (cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451) Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa Signed-off-by: m00891 <[email protected]> * Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP. (cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd) Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d: 1. 256 Blocksize launch for small shape inputgrad; 2. FastDivMod in inputgrad and filtergrad; 3. shared memory to put output_grad_data in small shape. (cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2) Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors. (cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1) Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3 Signed-off-by: m00891 <[email protected]> * Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors." This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1. (cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37) Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb Signed-off-by: m00891 <[email protected]> * improve ScatterInitCUDAKernel and ScatterCUDAKernel (cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24) Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587 Signed-off-by: m00891 <[email protected]> * fix bugs and make the code easier to read (cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836) Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f Signed-off-by: m00891 <[email protected]> * Optimize FilterGard and InputGradSpL Use tmp to store ldg data in the loop so calculate and ldg time can fold each other. (cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e) Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks. (cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978) Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e Signed-off-by: m00891 <[email protected]> * Optimize SwinTransformer 1.LayerNormBackward: remove if statement, now will always loop VPT times for ldg128 in compiler, bool flag to control if write action will be taken or not; 2.ContiguousCaseOneFunc: tmp saving division result for less division (cherry picked from commit 422d676507308d26f6107bed924424166aa350d3) Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429 Signed-off-by: m00891 <[email protected]> * Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize Set BlockDim.z to make blockSize always be 512, each block can handle several batches. Then all threads will loop 4 times for better performance. (cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b) Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148 Signed-off-by: m00891 <[email protected]> * improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce. (cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7) Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde Signed-off-by: m00891 <[email protected]> * Modify LayerNorm Optimization Might have lossdiff with old optimization without atomicAdd. (cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc) Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637 Signed-off-by: m00891 <[email protected]> * improve roi_align op:1.adaptive block size;2.FastDivMod. (cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71) Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe Signed-off-by: m00891 <[email protected]> * add workaround for parameters dislocation when calling BatchedGEMM<float16>. Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927 * fix McFlashAttn string Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34 * [C500-27046] fix wb issue Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7 * Support compiling external ops Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a * support flash attn varlen api and support arm build Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673 * Add a copyright notice Change-Id: I8ece364d926596a40f42d973190525d9b8224d99 * Modify some third-party dependency addresses to public network addresses --------- Signed-off-by: m00891 <[email protected]> Co-authored-by: risemeup1 <[email protected]> Co-authored-by: Nyakku Shigure <[email protected]> Co-authored-by: gouzil <[email protected]> Co-authored-by: Wang Bojun <[email protected]> Co-authored-by: lizexu123 <[email protected]> Co-authored-by: danleifeng <[email protected]> Co-authored-by: Vigi Zhang <[email protected]> Co-authored-by: tianhaodongbd <[email protected]> Co-authored-by: zyfncg <[email protected]> Co-authored-by: JYChen <[email protected]> Co-authored-by: zhaohaixu <[email protected]> Co-authored-by: Spelling <[email protected]> Co-authored-by: zhouzj <[email protected]> Co-authored-by: wanghuancoder <[email protected]> Co-authored-by: ndren <[email protected]> Co-authored-by: Nguyen Cong Vinh <[email protected]> Co-authored-by: Ruibin Cheung <[email protected]> Co-authored-by: Tian <[email protected]> Co-authored-by: Yuanle Liu <[email protected]> Co-authored-by: zhuyipin <[email protected]> Co-authored-by: 6clc <[email protected]> Co-authored-by: Wenyu <[email protected]> Co-authored-by: Xianduo Li <[email protected]> Co-authored-by: Wang Xin <[email protected]> Co-authored-by: Chang Xu <[email protected]> Co-authored-by: wentao yu <[email protected]> Co-authored-by: zhink <[email protected]> Co-authored-by: handiz <[email protected]> Co-authored-by: zhimin Pan <[email protected]> Co-authored-by: m00891 <[email protected]> Co-authored-by: shuliu <[email protected]> Co-authored-by: Yanxin Zhou <[email protected]> Co-authored-by: Zhao Wu <[email protected]> Co-authored-by: m00932 <[email protected]> Co-authored-by: Fangzhou Feng <[email protected]> Co-authored-by: junwang <[email protected]> Co-authored-by: m01097 <[email protected]>
- Loading branch information