[r/LocalLLaMA]score: 0.13

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

May 31, 2026

NVIDIA's Parakeet FastConformer models (TDT, CTC, RNNT, hybrid variants) now run in C++ via ggml with no Python or PyTorch dependency, matching NeMo output byte-for-byte on f32/f16 paths. GPU inference hits roughly 600x realtime on a 23-second clip, up to 5x faster than NeMo's PyTorch runtime, with GGUF quantization down to q4_k cutting memory roughly in half. The implementation supports CUDA, HIP, Vulkan, and Metal, plus streaming with word-level timestamps and a flat C API.

resources

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1tt6oja/i_ported_nvidia_parakeet_speechtotext_to_ggml/

← back to feed