Cublaslt Grouped Gemm Documentation Jun 2026

While cublasLtMatmul is the general entry point, specific support for grouped execution often relies on passing the cublasLtMatmulDesc_t configured with grouped attributes, or using specific helper functions if available in the backend wrapper.

October 26, 2023 Subject: Documentation and Usage of Grouped GEMM in NVIDIA cuBLASLt cublaslt grouped gemm documentation

📖 NVIDIA cuBLASLt Developer Guide → Grouped GEMM section While cublasLtMatmul is the general entry point, specific

Unlike standard batched GEMMs, each operation in a group can have unique dimensions. Grouped GEMM kernels often require shared memory or

Would you like a shorter version for Twitter/X or a code snippet example to accompany this post?

Grouped GEMM kernels often require shared memory or global memory workspace to coordinate workgroups. Allocating a sufficient workspace (e.g., 32MB) via cublasLtMatmulPreferenceSetAttribute allows the heuristic to select high-performance split-K or batched epilogue kernels.

: Use cublasLtMatmulDesc_t and cublasLtMatrixLayout_t to define the math and data layout for your matrices.