[r/LocalLLaMA]score: 0.13
I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
May 31, 2026
NVIDIA's Parakeet FastConformer models (TDT, CTC, RNNT, hybrid variants) now run in C++ via ggml with no Python or PyTorch dependency, matching NeMo output byte-for-byte on f32/f16 paths. GPU inference hits roughly 600x realtime on a 23-second clip, up to 5x faster than NeMo's PyTorch runtime, with GGUF quantization down to q4_k cutting memory roughly in half. The implementation supports CUDA, HIP, Vulkan, and Metal, plus streaming with word-level timestamps and a flat C API.
resources