[HUGGINGFACE]score: 0.48
κ-SwiGLU Adapts MoE Expert Gate Sharpness Per Token via Router Confidence
May 29, 2026
Confidence-Aware SwiGLU (κ-SwiGLU) parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit in MoE transformers, allowing each expert to interpolate between smooth and selective gating per token. Evaluated on FineWeb-Edu across 8–28 layer MoE models, it improves mean CORE benchmark performance.
paper
HOW THIS AFFECTS YOU
●
researcherAdaptive gate sharpness in MoE MLPs is a low-overhead modification with measured CORE benchmark gains, worth evaluating in MoE pretraining runs.