Batched cosine, dot, L2 distance for f32 embeddings, with a heap-based top-k selector. No BLAS, no allocator surprises.