【MetaX】Merge Metax's modifications to mxmaca/2.6 branch (#68534)

* fix windows bug for common lib (#60308) * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * Update inference_lib.cmake * [Dy2St] Disable `test_bert` on CPU (#60173) (#60324) Co-authored-by: gouzil <[email protected]> * [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184) * fix weight-only quant kernel error for n div 64 !=0 * code style fix * tile (#60261) * add chunk allocator posix_memalign return value check (#60208) (#60495) * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * update 2023 security advisory, test=document_fix (#60532) * fix fleetutil get_online_pass_interval bug2; test=develop (#60545) * fix fused_rope diff (#60217) (#60593) * [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620) * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * [cherry-pick]update pdsa-2023-019 (#60649) * update 2023 security advisory, test=document_fix * update pdsa-2023-019, test=document_fix * [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662) * fix bug of ci (#59926) (#60785) * [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786) * [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README * [Docs] Update latest release version in README (#60691) * restore order * [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875) * [Cherry-pick] fix set_value with scalar grad (#60930) * Fix set value grad (#59034) * first fix the UT * fix set value grad * polish code * add static mode backward test * always has input valuetensor * add dygraph test * Fix shape error in combined-indexing setitem (#60447) * add ut * fix shape error in combine-indexing * fix ut * Set value with scalar (#60452) * set_value with scalar * fix ut * remove test_pir * remove one test since 2.6 not support uint8-add * [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772) * fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067) * fix qat tests (#61211) (#61284) * [Security] fix draw security problem (#61161) (#61338) * fix draw security problem * fix _decompress security problem (#61294) (#61337) * Fix CVE-2024-0521 (#61032) (#61287) This uses shlex for safe command parsing to fix arbitrary code injection Co-authored-by: ndren <[email protected]> * [Security] fix security problem for prune_by_memory_estimation (#61382) * OS Command Injection prune_by_memory_estimation fix * Fix StyleCode * [Security] fix security problem for run_cmd (#61285) (#61398) * fix security problem for run_cmd * [Security] fix download security problem (#61162) (#61388) * fix download security problem * check eval for security (#61389) * [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045) Co-authored-by: Tian <[email protected]> * [CherryPick] Fix issue 60092 (#61427) * fix issue 60092 * update * update * update * Fix unique (#60840) (#61044) * fix unique kernel, row to num_out * cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586) * remove _wget (#61356) (#61569) * remove _wget * remove _wget * remove wget test * fix layer_norm decompose dtyte bugs, polish codes (#61631) * fix doc style (#61688) * merge (#61866) * [security] refine _get_program_cache_key (#61827) (#61896) * security, refine _get_program_cache_key * repeat_interleave support bf16 dtype (#61854) (#61899) * repeat_interleave support bf16 dtype * support bf16 on cpu * Support Fake GroupWise Quant (#61900) * fix launch when elastic run (#61847) (#61878) * [Paddle-TRT] fix solve (#61806) * [Cherry-Pick] Fix CacheKV Quant Bug (#61966) * fix cachekv quant problem * add unittest * Sychronized the paddle2.4 adaptation changes * clear third_part dependencies * change submodules to right commits * build pass with cpu only * build success with maca * build success with cutlass and fused kernels * build with flash_attn and mccl * build with test, fix some bugs * fix some bugs * fixed some compilation bugs * fix bug in previous commit * fix bug with split when col_size biger than 256 * add row_limit to show full kernel name * add env.sh Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1 * add shape record Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738 * modify paddle version Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98 * wuzhao optimized the performance of elementwise kernel. Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c * fix split when dtype is fp16 Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597 * fix bug in previous commit Change-Id: I0fa66120160374da5a774ef2c04f133a54517069 * adapt flash_attn new capi Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba * change eigen path Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c * modify mcname -> replaced_name Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e * fix some build bugs Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d * add PADDLE_ENABLE_SAME_RAND_A100 Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61 done * remove redundant warning, add patch from 2.6.1 Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa * improve VectorizedBroadcastKernel (cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536) Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c Signed-off-by: m00891 <[email protected]> * fix bugs (cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c) Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1 Signed-off-by: m00891 <[email protected]> * split ElementwiseDivGrad (cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f) Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6 Signed-off-by: m00891 <[email protected]> * in VectorizedElementwiseKernel, it can now use vecSize = 8 (cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5) Change-Id: Ia703b1e9e959558988fcd09182387da839d33922 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches. (cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb) Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d_grad compute (InputGrad): 1.use shared memory to optimize data load from global memory; 2.different blocksize for different input shape 3.FastDivMod for input shape div, >> and & for stride div. (cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20) Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229 Signed-off-by: m00891 <[email protected]> * improve VectorizedBroadcastKernel with LoadType = 2(kMixed) (cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f) Change-Id: I282dd8284a7cde54061780a22b397133303f51e5 Signed-off-by: m00891 <[email protected]> * fix ElementwiseDivGrad (cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb) Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c Signed-off-by: m00891 <[email protected]> * Revert "Optimize depthwise_conv2d_grad compute (InputGrad):" This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20. (cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb) Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766 Signed-off-by: m00891 <[email protected]> * improve ElementwiseDivGrad and ElementwiseMulGrad (cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67) Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c Signed-off-by: m00891 <[email protected]> * improve FilterBBoxes (cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05) Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45 Signed-off-by: m00891 <[email protected]> * improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43) Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8) Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732 Signed-off-by: m00891 <[email protected]> * improve KeBNBackwardData:replace 1.0/sqrt with rsqrt (cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451) Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa Signed-off-by: m00891 <[email protected]> * Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP. (cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd) Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d: 1. 256 Blocksize launch for small shape inputgrad; 2. FastDivMod in inputgrad and filtergrad; 3. shared memory to put output_grad_data in small shape. (cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2) Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors. (cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1) Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3 Signed-off-by: m00891 <[email protected]> * Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors." This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1. (cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37) Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb Signed-off-by: m00891 <[email protected]> * improve ScatterInitCUDAKernel and ScatterCUDAKernel (cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24) Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587 Signed-off-by: m00891 <[email protected]> * fix bugs and make the code easier to read (cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836) Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f Signed-off-by: m00891 <[email protected]> * Optimize FilterGard and InputGradSpL Use tmp to store ldg data in the loop so calculate and ldg time can fold each other. (cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e) Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks. (cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978) Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e Signed-off-by: m00891 <[email protected]> * Optimize SwinTransformer 1.LayerNormBackward: remove if statement, now will always loop VPT times for ldg128 in compiler, bool flag to control if write action will be taken or not; 2.ContiguousCaseOneFunc: tmp saving division result for less division (cherry picked from commit 422d676507308d26f6107bed924424166aa350d3) Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429 Signed-off-by: m00891 <[email protected]> * Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize Set BlockDim.z to make blockSize always be 512, each block can handle several batches. Then all threads will loop 4 times for better performance. (cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b) Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148 Signed-off-by: m00891 <[email protected]> * improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce. (cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7) Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde Signed-off-by: m00891 <[email protected]> * Modify LayerNorm Optimization Might have lossdiff with old optimization without atomicAdd. (cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc) Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637 Signed-off-by: m00891 <[email protected]> * improve roi_align op:1.adaptive block size;2.FastDivMod. (cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71) Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe Signed-off-by: m00891 <[email protected]> * add workaround for parameters dislocation when calling BatchedGEMM<float16>. Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927 * fix McFlashAttn string Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34 * [C500-27046] fix wb issue Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7 * Support compiling external ops Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a * support flash attn varlen api and support arm build Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673 * Add a copyright notice Change-Id: I8ece364d926596a40f42d973190525d9b8224d99 * Modify some third-party dependency addresses to public network addresses --------- Signed-off-by: m00891 <[email protected]> Co-authored-by: risemeup1 <[email protected]> Co-authored-by: Nyakku Shigure <[email protected]> Co-authored-by: gouzil <[email protected]> Co-authored-by: Wang Bojun <[email protected]> Co-authored-by: lizexu123 <[email protected]> Co-authored-by: danleifeng <[email protected]> Co-authored-by: Vigi Zhang <[email protected]> Co-authored-by: tianhaodongbd <[email protected]> Co-authored-by: zyfncg <[email protected]> Co-authored-by: JYChen <[email protected]> Co-authored-by: zhaohaixu <[email protected]> Co-authored-by: Spelling <[email protected]> Co-authored-by: zhouzj <[email protected]> Co-authored-by: wanghuancoder <[email protected]> Co-authored-by: ndren <[email protected]> Co-authored-by: Nguyen Cong Vinh <[email protected]> Co-authored-by: Ruibin Cheung <[email protected]> Co-authored-by: Tian <[email protected]> Co-authored-by: Yuanle Liu <[email protected]> Co-authored-by: zhuyipin <[email protected]> Co-authored-by: 6clc <[email protected]> Co-authored-by: Wenyu <[email protected]> Co-authored-by: Xianduo Li <[email protected]> Co-authored-by: Wang Xin <[email protected]> Co-authored-by: Chang Xu <[email protected]> Co-authored-by: wentao yu <[email protected]> Co-authored-by: zhink <[email protected]> Co-authored-by: handiz <[email protected]> Co-authored-by: zhimin Pan <[email protected]> Co-authored-by: m00891 <[email protected]> Co-authored-by: shuliu <[email protected]> Co-authored-by: Yanxin Zhou <[email protected]> Co-authored-by: Zhao Wu <[email protected]> Co-authored-by: m00932 <[email protected]> Co-authored-by: Fangzhou Feng <[email protected]> Co-authored-by: junwang <[email protected]> Co-authored-by: m01097 <[email protected]>
PaddlePaddle · Sep 30, 2024 · b102bc4 · b102bc4
1 parent e032331
commit b102bc4
Show file tree

Hide file tree

Showing 310 changed files with 8,435 additions and 4,921 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -1,6 +1,7 @@
 [submodule "third_party/protobuf"]
 	path = third_party/protobuf
 	url = https://github.com/protocolbuffers/protobuf.git
+	tag = paddle
 	ignore = dirty
 [submodule "third_party/pocketfft"]
 	path = third_party/pocketfft
@@ -21,10 +22,11 @@
 [submodule "third_party/utf8proc"]
 	path = third_party/utf8proc
 	url = https://github.com/JuliaStrings/utf8proc.git
+	tag = v2.6.1
 	ignore = dirty
 [submodule "third_party/warpctc"]
 	path = third_party/warpctc
-	url = https://github.com/baidu-research/warp-ctc.git
+	url = http://pdegit.metax-internal.com/pde-ai/warp-ctc.git
 	ignore = dirty
 [submodule "third_party/warprnnt"]
 	path = third_party/warprnnt
@@ -33,10 +35,12 @@
 [submodule "third_party/xxhash"]
 	path = third_party/xxhash
 	url = https://github.com/Cyan4973/xxHash.git
+	tag = v0.6.5
 	ignore = dirty
 [submodule "third_party/pybind"]
 	path = third_party/pybind
 	url = https://github.com/pybind/pybind11.git
+	tag = v2.4.3
 	ignore = dirty
 [submodule "third_party/threadpool"]
 	path = third_party/threadpool
@@ -45,39 +49,25 @@
 [submodule "third_party/zlib"]
 	path = third_party/zlib
 	url = https://github.com/madler/zlib.git
+	tag = v1.2.8
 	ignore = dirty
 [submodule "third_party/glog"]
 	path = third_party/glog
 	url = https://github.com/google/glog.git
 	ignore = dirty
-[submodule "third_party/eigen3"]
-	path = third_party/eigen3
-	url = https://gitlab.com/libeigen/eigen.git
-	ignore = dirty
 [submodule "third_party/snappy"]
 	path = third_party/snappy
 	url = https://github.com/google/snappy.git
 	ignore = dirty
-[submodule "third_party/cub"]
-	path = third_party/cub
-	url = https://github.com/NVIDIA/cub.git
-	ignore = dirty
-[submodule "third_party/cutlass"]
-	path = third_party/cutlass
-	url = https://github.com/NVIDIA/cutlass.git
-	ignore = dirty
 [submodule "third_party/xbyak"]
 	path = third_party/xbyak
 	url = https://github.com/herumi/xbyak.git
+	tag = v5.81
 	ignore = dirty
 [submodule "third_party/mkldnn"]
 	path = third_party/mkldnn
 	url = https://github.com/oneapi-src/oneDNN.git
 	ignore = dirty
-[submodule "third_party/flashattn"]
-	path = third_party/flashattn
-	url = https://github.com/PaddlePaddle/flash-attention.git
-	ignore = dirty
 [submodule "third_party/gtest"]
 	path = third_party/gtest
 	url = https://github.com/google/googletest.git
@@ -98,15 +88,11 @@
 	path = third_party/rocksdb
 	url = https://github.com/Thunderbrook/rocksdb
 	ignore = dirty
-[submodule "third_party/absl"]
-	path = third_party/absl
-	url = https://github.com/abseil/abseil-cpp.git
-	ignore = dirty
-[submodule "third_party/jitify"]
-	path = third_party/jitify
-	url = https://github.com/NVIDIA/jitify.git
+[submodule "third_party/cutlass"]
+	path = third_party/cutlass
+	url = http://pdegit.metax-internal.com/pde-ai/cutlass.git
 	ignore = dirty
-[submodule "third_party/cccl"]
-	path = third_party/cccl
-	url = https://github.com/NVIDIA/cccl.git
+[submodule "third_party/eigen3"]
+	path = third_party/eigen3
+	url = ssh://gerrit.metax-internal.com:29418/MACA/library/mcEigen
 	ignore = dirty
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,3 +1,4 @@
+# 2024 - Modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. All Rights Reserved.   
 # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,7 +25,7 @@ endif()
 # https://cmake.org/cmake/help/v3.0/policy/CMP0026.html?highlight=cmp0026
 cmake_policy(SET CMP0026 OLD)
 cmake_policy(SET CMP0079 NEW)
-set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake" $ENV{CMAKE_MODULE_PATH})
 set(PADDLE_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
 set(PADDLE_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
 
@@ -92,6 +93,7 @@ endif()
 
 if(WITH_GPU AND NOT APPLE)
   enable_language(CUDA)
+  set(CMAKE_CUDA_COMPILER_VERSION 11.6)
   message(STATUS "CUDA compiler: ${CMAKE_CUDA_COMPILER}, version: "
                  "${CMAKE_CUDA_COMPILER_ID} ${CMAKE_CUDA_COMPILER_VERSION}")
 endif()
@@ -255,7 +257,7 @@ option(WITH_SYSTEM_BLAS "Use system blas library" OFF)
 option(WITH_DISTRIBUTE "Compile with distributed support" OFF)
 option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF)
 option(ON_INFER "Turn on inference optimization and inference-lib generation"
-       ON)
+       OFF)
 option(WITH_CPP_DIST "Install PaddlePaddle C++ distribution" OFF)
 option(WITH_GFLAGS "Compile PaddlePaddle with gflags support" OFF)
 ################################ Internal Configurations #######################################
@@ -283,7 +285,7 @@ option(
   OFF)
 option(WITH_LITE "Compile Paddle Fluid with Lite Engine" OFF)
 option(WITH_CINN "Compile PaddlePaddle with CINN" OFF)
-option(WITH_NCCL "Compile PaddlePaddle with NCCL support" ON)
+option(WITH_NCCL "Compile PaddlePaddle with NCCL support" OFF)
 option(WITH_RCCL "Compile PaddlePaddle with RCCL support" ON)
 option(WITH_XPU_BKCL "Compile PaddlePaddle with BAIDU KUNLUN XPU BKCL" OFF)
 option(WITH_CRYPTO "Compile PaddlePaddle with crypto support" ON)
@@ -474,6 +476,21 @@ if(WITH_GPU)
   # so include(cudnn) needs to be in front of include(third_party/lite)
   include(cudnn) # set cudnn libraries, must before configure
   include(tensorrt)
+
+  include_directories("$ENV{MACA_PATH}/tools/cu-bridge/include")
+  include_directories("$ENV{MACA_PATH}/include")
+  include_directories("$ENV{MACA_PATH}/include/mcblas")
+  include_directories("$ENV{MACA_PATH}/include/mcr")
+  include_directories("$ENV{MACA_PATH}/include/mcdnn")
+  include_directories("$ENV{MACA_PATH}/include/mcsim")
+  include_directories("$ENV{MACA_PATH}/include/mcsparse")
+  include_directories("$ENV{MACA_PATH}/include/mcfft")
+  include_directories("$ENV{MACA_PATH}/include/mcrand")
+  include_directories("$ENV{MACA_PATH}/include/common")
+  include_directories("$ENV{MACA_PATH}/include/mcsolver")
+  include_directories("$ENV{MACA_PATH}/include/mctx")
+  include_directories("$ENV{MACA_PATH}/include/mcpti")
+  include_directories("$ENV{MACA_PATH}/mxgpu_llvm/include")
   # there is no official support of nccl, cupti in windows
   if(NOT WIN32)
     include(cupti)

diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,183 @@
+The following files may have been modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. in 2024. 
+
+.gitmodules
+CMakeLists.txt
+cmake/cuda.cmake
+cmake/cudnn.cmake
+cmake/cupti.cmake
+cmake/external/brpc.cmake
+cmake/external/cryptopp.cmake
+cmake/external/cutlass.cmake
+cmake/external/dgc.cmake
+cmake/external/dlpack.cmake
+cmake/external/eigen.cmake
+cmake/external/flashattn.cmake
+cmake/external/jemalloc.cmake
+cmake/external/lapack.cmake
+cmake/external/libmct.cmake
+cmake/external/mklml.cmake
+cmake/external/protobuf.cmake
+cmake/external/pybind11.cmake
+cmake/external/utf8proc.cmake
+cmake/flags.cmake
+cmake/generic.cmake
+cmake/inference_lib.cmake
+cmake/nccl.cmake
+cmake/third_party.cmake
+env.sh
+paddle/fluid/distributed/fleet_executor/test/interceptor_ping_pong_with_brpc_test.cc
+paddle/fluid/eager/api/manual/eager_manual/forwards/multiply_fwd_func.cc
+paddle/fluid/eager/auto_code_generator/eager_generator.cc
+paddle/fluid/eager/auto_code_generator/generator/eager_gen.py
+paddle/fluid/framework/details/build_strategy.cc
+paddle/fluid/framework/distributed_strategy.proto
+paddle/fluid/inference/api/resource_manager.cc
+paddle/fluid/inference/api/resource_manager.h
+paddle/fluid/inference/tensorrt/plugin/layernorm_shift_partition_op.cu
+paddle/fluid/inference/tensorrt/plugin/matmul_op_int8_plugin.h
+paddle/fluid/inference/tensorrt/plugin/preln_residual_bias_plugin.cu
+paddle/fluid/memory/allocation/CMakeLists.txt
+paddle/fluid/memory/allocation/allocator_facade.cc
+paddle/fluid/operators/CMakeLists.txt
+paddle/fluid/operators/correlation_op.cu
+paddle/fluid/operators/elementwise/elementwise_op_function.h
+paddle/fluid/operators/fused/CMakeLists.txt
+paddle/fluid/operators/fused/attn_gemm_int8.h
+paddle/fluid/operators/fused/cublaslt.h
+paddle/fluid/operators/fused/fused_gate_attention.h
+paddle/fluid/operators/fused/fused_gemm_epilogue_op.cu
+paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h
+paddle/fluid/operators/fused/fused_multi_transformer_int8_op.cu
+paddle/fluid/operators/fused/fused_multi_transformer_op.cu
+paddle/fluid/operators/fused/fused_multi_transformer_op.cu.h
+paddle/fluid/operators/fused/fused_softmax_mask.cu.h
+paddle/fluid/operators/math/inclusive_scan.h
+paddle/fluid/operators/matmul_op.cc
+paddle/fluid/operators/row_conv_op.cu
+paddle/fluid/operators/sparse_attention_op.cu
+paddle/fluid/platform/cuda_graph_with_memory_pool.cc
+paddle/fluid/platform/device/gpu/cuda/cuda_helper.h
+paddle/fluid/platform/device/gpu/cuda_helper_test.cu
+paddle/fluid/platform/device/gpu/gpu_types.h
+paddle/fluid/platform/device_context.h
+paddle/fluid/platform/dynload/CMakeLists.txt
+paddle/fluid/platform/dynload/cublas.h
+paddle/fluid/platform/dynload/cublasLt.cc
+paddle/fluid/platform/dynload/cublasLt.h
+paddle/fluid/platform/dynload/cusparseLt.h
+paddle/fluid/platform/init.cc
+paddle/fluid/platform/init_phi_test.cc
+paddle/fluid/pybind/eager_legacy_op_function_generator.cc
+paddle/fluid/pybind/fleet_py.cc
+paddle/fluid/pybind/pybind.cc
+paddle/phi/api/profiler/profiler.cc
+paddle/phi/backends/dynload/CMakeLists.txt
+paddle/phi/backends/dynload/cublas.h
+paddle/phi/backends/dynload/cublasLt.cc
+paddle/phi/backends/dynload/cublasLt.h
+paddle/phi/backends/dynload/cuda_driver.h
+paddle/phi/backends/dynload/cudnn.h
+paddle/phi/backends/dynload/cufft.h
+paddle/phi/backends/dynload/cupti.h
+paddle/phi/backends/dynload/curand.h
+paddle/phi/backends/dynload/cusolver.h
+paddle/phi/backends/dynload/cusparse.h
+paddle/phi/backends/dynload/cusparseLt.h
+paddle/phi/backends/dynload/dynamic_loader.cc
+paddle/phi/backends/dynload/flashattn.h
+paddle/phi/backends/dynload/nccl.h
+paddle/phi/backends/dynload/nvjpeg.h
+paddle/phi/backends/dynload/nvrtc.h
+paddle/phi/backends/dynload/nvtx.h
+paddle/phi/backends/gpu/cuda/cuda_device_function.h
+paddle/phi/backends/gpu/cuda/cuda_helper.h
+paddle/phi/backends/gpu/forwards.h
+paddle/phi/backends/gpu/gpu_context.cc
+paddle/phi/backends/gpu/gpu_context.h
+paddle/phi/backends/gpu/gpu_decls.h
+paddle/phi/backends/gpu/gpu_resources.cc
+paddle/phi/backends/gpu/gpu_resources.h
+paddle/phi/backends/gpu/rocm/rocm_device_function.h
+paddle/phi/core/custom_kernel.cc
+paddle/phi/core/distributed/check/nccl_dynamic_check.h
+paddle/phi/core/distributed/comm_context_manager.h
+paddle/phi/core/enforce.h
+paddle/phi/core/flags.cc
+paddle/phi/core/visit_type.h
+paddle/phi/kernels/funcs/aligned_vector.h
+paddle/phi/kernels/funcs/blas/blas_impl.cu.h
+paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h
+paddle/phi/kernels/funcs/broadcast_function.h
+paddle/phi/kernels/funcs/concat_and_split_functor.cu
+paddle/phi/kernels/funcs/cublaslt.h
+paddle/phi/kernels/funcs/deformable_conv_functor.cu
+paddle/phi/kernels/funcs/distribution_helper.h
+paddle/phi/kernels/funcs/dropout_impl.cu.h
+paddle/phi/kernels/funcs/elementwise_base.h
+paddle/phi/kernels/funcs/elementwise_grad_base.h
+paddle/phi/kernels/funcs/fused_gemm_epilogue.h
+paddle/phi/kernels/funcs/gemm_int8_helper.h
+paddle/phi/kernels/funcs/inclusive_scan.h
+paddle/phi/kernels/funcs/layer_norm_impl.cu.h
+paddle/phi/kernels/funcs/math_cuda_utils.h
+paddle/phi/kernels/funcs/reduce_function.h
+paddle/phi/kernels/funcs/scatter.cu.h
+paddle/phi/kernels/funcs/top_k_function_cuda.h
+paddle/phi/kernels/funcs/weight_only_gemv.cu
+paddle/phi/kernels/fusion/cutlass/utils/cuda_utils.h
+paddle/phi/kernels/fusion/gpu/attn_gemm.h
+paddle/phi/kernels/fusion/gpu/fused_dropout_add_utils.h
+paddle/phi/kernels/fusion/gpu/fused_dropout_helper.h
+paddle/phi/kernels/fusion/gpu/fused_layernorm_residual_dropout_bias.h
+paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu
+paddle/phi/kernels/fusion/gpu/fused_softmax_mask_upper_triangle_utils.h
+paddle/phi/kernels/fusion/gpu/fused_softmax_mask_utils.h
+paddle/phi/kernels/fusion/gpu/mmha_util.cu.h
+paddle/phi/kernels/gpu/accuracy_kernel.cu
+paddle/phi/kernels/gpu/amp_kernel.cu
+paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
+paddle/phi/kernels/gpu/contiguous_kernel.cu
+paddle/phi/kernels/gpu/decode_jpeg_kernel.cu
+paddle/phi/kernels/gpu/deformable_conv_grad_kernel.cu
+paddle/phi/kernels/gpu/depthwise_conv.h
+paddle/phi/kernels/gpu/dist_kernel.cu
+paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu
+paddle/phi/kernels/gpu/flash_attn_kernel.cu
+paddle/phi/kernels/gpu/flash_attn_utils.h
+paddle/phi/kernels/gpu/gelu_funcs.h
+paddle/phi/kernels/gpu/generate_proposals_kernel.cu
+paddle/phi/kernels/gpu/group_norm_kernel.cu
+paddle/phi/kernels/gpu/interpolate_grad_kernel.cu
+paddle/phi/kernels/gpu/kthvalue_kernel.cu
+paddle/phi/kernels/gpu/llm_int8_linear_kernel.cu
+paddle/phi/kernels/gpu/masked_select_kernel.cu
+paddle/phi/kernels/gpu/nonzero_kernel.cu
+paddle/phi/kernels/gpu/roi_align_grad_kernel.cu
+paddle/phi/kernels/gpu/roi_align_kernel.cu
+paddle/phi/kernels/gpu/strided_copy_kernel.cu
+paddle/phi/kernels/gpu/top_k_kernel.cu
+paddle/phi/kernels/gpu/top_p_sampling_kernel.cu
+paddle/phi/kernels/gpu/unique_consecutive_functor.h
+paddle/phi/kernels/gpu/unique_kernel.cu
+paddle/phi/kernels/gpudnn/conv_cudnn_v7.h
+paddle/phi/kernels/gpudnn/softmax_gpudnn.h
+paddle/phi/kernels/impl/deformable_conv_grad_kernel_impl.h
+paddle/phi/kernels/impl/llm_int8_matmul_kernel_impl.h
+paddle/phi/kernels/impl/matmul_kernel_impl.h
+paddle/phi/kernels/impl/multi_dot_kernel_impl.h
+paddle/phi/kernels/primitive/datamover_primitives.h
+paddle/phi/kernels/primitive/kernel_primitives.h
+paddle/phi/tools/CMakeLists.txt
+paddle/utils/flat_hash_map.h
+patches/eigen/TensorReductionGpu.h
+python/paddle/base/framework.py
+python/paddle/distributed/launch/controllers/watcher.py
+python/paddle/profiler/profiler_statistic.py
+python/paddle/utils/cpp_extension/cpp_extension.py
+python/paddle/utils/cpp_extension/extension_utils.py
+test/CMakeLists.txt
+test/cpp/CMakeLists.txt
+test/cpp/jit/CMakeLists.txt
+test/cpp/new_executor/CMakeLists.txt
+test/legacy_test/test_flash_attention.py
+tools/ci_op_benchmark.sh
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ PaddlePaddle is originated from industrial practices with dedication and commitm
 
 ## Installation
 
-### Latest PaddlePaddle Release: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
+### Latest PaddlePaddle Release: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
 
 Our vision is to enable deep learning for everyone via PaddlePaddle.
 Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest features of PaddlePaddle.

diff --git a/README_cn.md b/README_cn.md
@@ -18,9 +18,9 @@
 
 ## 安装
 
-### PaddlePaddle最新版本: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
+### PaddlePaddle 最新版本: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
 
-跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
+跟进 PaddlePaddle 最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
 
 ### 安装最新稳定版本:
 ```

diff --git a/README_ja.md b/README_ja.md
@@ -20,7 +20,7 @@ PaddlePaddle は、工業化に対するコミットメントを持つ工業的
 
 ## インストール
 
-### PaddlePaddle の最新リリース: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
+### PaddlePaddle の最新リリース: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
 
 私たちのビジョンは、PaddlePaddle を通じて、誰もが深層学習を行えるようにすることです。
 PaddlePaddle の最新機能を追跡するために、私たちの[リリースのお知らせ](https://github.com/PaddlePaddle/Paddle/releases)を参照してください。