01/24/2026
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
New York University 2026
https://arxiv.org/abs/2601.16208
Using a simpler design to unlock better AI image generation.
By operating in a shared, high-dimensional semantic representation space, Representation Autoencoders (RAEs) allow diffusion architectures to be simplified at scale.
In controlled comparisons against the FLUX VAE, RAE-based diffusion transformers consistently converge faster, avoid catastrophic overfitting during finetuning, and achieve higher-quality text-to-image generation across model sizes from 0.5B to 9.8B parameters.
Notes:
Modern text-to-image systems—those that turn written prompts into pictures—usually work by compressing images into a hidden “latent” space, generating new content there, and then decoding it back into pixels. The quality of that hidden space matters enormously. This paper asks a simple but consequential question: are we using the right kind of latent representation as these models scale up?
Traditionally, most large text-to-image models rely on *variational autoencoders* (VAEs) to create this latent space. But recent work on image classification hinted that a different approach—*representation autoencoders* (RAEs)—might offer cleaner, more semantic representations. The authors of this study explore whether RAEs can handle the far messier world of large-scale, freeform text-to-image generation.
To test this, they scale RAEs well beyond curated datasets like ImageNet, training them instead on a mix of web images, synthetic data, and images containing text. They find that simply making models bigger improves overall image quality—but that *what data you train on still matters*, especially for tricky domains like rendering readable text in images.
The team then takes a hard look at the design tricks that previously made RAEs work on smaller datasets. Surprisingly, many of these complexities turn out to be unnecessary at scale. As models grow, the system actually becomes simpler: only the way noise is scheduled during diffusion remains critical, while other architectural embellishments add little value.
With this streamlined setup, the researchers run a head-to-head comparison between RAEs and a leading VAE-based system (FLUX), scaling models from hundreds of millions to nearly ten billion parameters. The results are striking. Across all sizes, RAE-based models learn faster, produce better images during pretraining, and—crucially—remain stable during long finetuning runs. In contrast, VAE-based models begin to overfit and collapse after extended training, even on high-quality datasets.
Beyond better images, RAEs offer a deeper advantage. Because both image understanding and image generation happen in the *same representation space*, the model can directly reason about what it generates—opening the door to systems that don’t just create images, but can also think about them in a unified way.
Taken together, the results suggest a shift in foundation: representation autoencoders are not just a viable alternative to VAEs—they may be a simpler, more robust, and more scalable backbone for the next generation of text-to-image models.
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decode...