[HUGGINGFACE]score: 0.48

κ-SwiGLU Adapts MoE Expert Gate Sharpness Per Token via Router Confidence

May 29, 2026

Confidence-Aware SwiGLU (κ-SwiGLU) parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit in MoE transformers, allowing each expert to interpolate between smooth and selective gating per token. Evaluated on FineWeb-Edu across 8–28 layer MoE models, it improves mean CORE benchmark performance.

paper

HOW THIS AFFECTS YOU

●

researcherAdaptive gate sharpness in MoE MLPs is a low-overhead modification with measured CORE benchmark gains, worth evaluating in MoE pretraining runs.

SOURCE

https://huggingface.co/papers/2606.00761

← back to feed