[HN]score: 0.05

Playing with Vision Embeddings

June 5, 2026

DINOv3 ViT-S compresses images into 384-dimensional embeddings, and this post reconstructs interpretable images from arbitrary points in that space using differentiable optimization with augmentation-based gradients, an untrained transformer backbone as image prior, and total variation regularization. The technique reveals what semantic information the self-supervised model encodes, demonstrated by interpolating between embeddings like corn kernels and the Triumphal Arch to produce visually coherent blends.

read original ↗prestonbjensen.com

← back to feed