●builderYou can swap Gefen in place of AdamW to potentially cut optimizer memory 8x, which could allow larger batch sizes or models on the same hardware.
●researcherWorth evaluating whether the memory reduction holds across model scales and whether convergence properties match AdamW on standard benchmarks.