HACKOBAR_item
[r/MachineLearning]score: 0.06

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

April 23, 2026
**Transformer Inference Optimization Beyond FP16 and Pruning** A practitioner reports hitting a plateau at ~162 MB per transformer model after applying FP16 conversion, ONNX Runtime export, and both structured/unstructured pruning, with none of the post-FP16 steps yielding meaningful size or latency reductions. This is a common finding: unstructured pruning rarely translates to real speedups without hardware-sparse support, and ONNX graph optimizations have diminishing returns after basic fusion passes. The most impactful next steps in practice are INT8 static quantization (typically 2–4× additional size reduction with <1% accuracy degradation on many encoder models), knowledge distillation if a smaller architecture is acceptable, or TensorRT deployment which can yield 2–5× latency improvements over vanilla ONNX Runtime on NVIDIA hardware through kernel fusion and precision mixing.
project