Single-file WebGPU GEMM library with device-tuned kernels (F32, F16). Use for fast matrix multiply in browser WebGPU apps.