While cublasLtMatmul is the general entry point, specific support for grouped execution often relies on passing the cublasLtMatmulDesc_t configured with grouped attributes, or using specific helper functions if available in the backend wrapper.
October 26, 2023 Subject: Documentation and Usage of Grouped GEMM in NVIDIA cuBLASLt cublaslt grouped gemm documentation
📖 NVIDIA cuBLASLt Developer Guide → Grouped GEMM section While cublasLtMatmul is the general entry point, specific
Unlike standard batched GEMMs, each operation in a group can have unique dimensions. Grouped GEMM kernels often require shared memory or
Would you like a shorter version for Twitter/X or a code snippet example to accompany this post?
Grouped GEMM kernels often require shared memory or global memory workspace to coordinate workgroups. Allocating a sufficient workspace (e.g., 32MB) via cublasLtMatmulPreferenceSetAttribute allows the heuristic to select high-performance split-K or batched epilogue kernels.
: Use cublasLtMatmulDesc_t and cublasLtMatrixLayout_t to define the math and data layout for your matrices.