Idiomatic Rust wrappers for the NVIDIA CUDA stack (Driver API, Runtime API, NVRTC, cuBLAS, cuDNN, NCCL, NVML, ...). Umbrella crate.
Build-time CUDA kernel compiler for the baracuda ecosystem: nvcc-driven incremental builds, parallel compilation, GPU auto-detection, and CUTLASS / custom git dependency support.
Build + raw FFI bindings to baracuda's clean-fork of Hiroyuki Ootomo's ozIMMU — the Ozaki-scheme FP64 GEMM library that synthesizes a DGEMM from S² int8 tensor-core matmuls. Phase 44b internalized the upstream sources under `cuda/` (no more `vendor/` subdir; cutf submodule eliminated). Linked statically into the baracuda CUDA stack; consumed by the safe wrapper crate `baracuda-ozimmu`. MIT-licensed (original ozIMMU MIT — see `ATTRIBUTION.md`).
Shared type vocabulary for the baracuda ML kernel facade: Element / IntElement / FpElement / BiasElement trait hierarchy, layout / epilogue / activation tags, MatrixRef / TensorRef views, PlanPreference, PrecisionGuarantee, and Workspace. Lifted from baracuda-cutlass so that baracuda-kernels and the per-library wrapper crates can share one vocabulary.
Build + raw FFI bindings to baracuda's port of NVIDIA TransformerEngine's FP8 cast/transpose + delayed-scaling recipe primitives. Cast/recipe subset only — `normalization` / `fused_rope` / `fused_attn` / `fused_softmax` / `activation` / `gemm` deliberately skipped (overlap existing baracuda Phase 3/5/14/17/30/31/36/41/42). NO cuDNN dep (recipe + cast paths don't need it; `fused_attn` would, and we skip it); NO pybind11 (the safe wrapper lives in `baracuda-transformer-engine` and exposes a raw C ABI defined in `csrc/baracuda_te_shim.cu`). Apache-2.0 per upstream — see `ATTRIBUTION.md`.
Megatron-LM-style tensor-parallel primitives (Column / Row Parallel Linear) for the baracuda CUDA stack. Pure-composition crate — local GEMM via baracuda-cublas + cross-rank collectives via baracuda-nccl. No new CUDA kernels. NEW in Phase 57; deliberate scope expansion (distributed-training-framework-adjacent). Off-by-default in baracuda-kernels via the `megatron_tp` cargo feature so non-distributed consumers don't pay the dep surface cost. Algorithmic reference: Shoeybi et al. arXiv:1909.08053 (NVIDIA Megatron-LM, Apache-2.0).
Safe Rust wrapper for baracuda's port of NVIDIA TransformerEngine's FP8 cast/transpose + delayed-scaling recipe primitives. Provides `Fp8Recipe` (delayed-scaling state with amax history), `Fp8CastPlan` for {f32, f16, bf16} → FP8 with running amax, `Fp8DequantPlan` for FP8 → {f32, f16, bf16}. Cast/recipe subset only — `normalization` / `fused_rope` / `fused_attn` / `fused_softmax` / `activation` / `gemm` skipped (overlap existing baracuda phases). NO cuDNN dep, NO pybind11. On Ada (sm_89) the FP8 wins are bandwidth-saving only (KV cache, weights); FP8 tensor-core math throughput equals BF16. Forward-compatible with Hopper / Blackwell where the compute wins also materialize.
Safe Rust wrapper for baracuda's clean-fork of Hiroyuki Ootomo's ozIMMU — Ozaki-scheme FP64 GEMM that synthesizes a DGEMM from S^2 int8 tensor-core matmuls. Provides an RAII handle + drop-in `dgemm` shim suitable for the `BackendKind::Ozaki` path of `baracuda-kernels`'s `GemmPlan`. Opt-in (NOT bit-equivalent to native DGEMM); the default FP64 path stays on CUTLASS / cuBLAS.
Unified ML op facade for the baracuda CUDA ecosystem. Exposes every primitive an ML framework would expect (union of PyTorch torch.* + nn.functional and JAX lax.* / numpy ops) through a single Plan-based Rust surface, internally dispatching to baracuda-cutlass, the baracuda-* NVIDIA-library wrappers, or bespoke baracuda-kernels-sys kernels.
Safe Rust wrapper for compiled CUTLASS kernels: plan-based GEMM and grouped GEMM with caller-supplied workspace, typed device-buffer arguments, and capture-safe launch.
Compiled bespoke .cu kernel template instantiations for the baracuda ML kernel facade plus C-ABI FFI facades for the library-backed plans (cuDNN conv/pool, cuSOLVER linalg, cuFFT/cuRAND, CUTLASS GEMM re-export). Hosts curated CUDA kernel sources (int8/FP8/int4/bin GEMM RRR, elementwise, reduce, norm, attention, …), builds them via baracuda-forge, exposes extern "C" entry points for the safe baracuda-kernels crate. CUTLASS template kernels live in the sibling baracuda-cutlass-kernels-sys crate and are re-exported here under the unified baracuda_kernels_gemm_* namespace.
Compiled CUTLASS template instantiations for the baracuda ecosystem. Hosts curated .cu kernel sources, builds them via baracuda-forge, exposes extern "C" entry points for the safe baracuda-cutlass crate.