[HN]score: 0.05
Playing with Vision Embeddings
June 5, 2026
DINOv3 ViT-S compresses images into 384-dimensional embeddings, and this post reconstructs interpretable images from arbitrary points in that space using differentiable optimization with augmentation-based gradients, an untrained transformer backbone as image prior, and total variation regularization. The technique reveals what semantic information the self-supervised model encodes, demonstrated by interpolating between embeddings like corn kernels and the Triumphal Arch to produce visually coherent blends.