We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
Unsupervised and self-supervised learning,1 or learning without human-labeled data, is a longstanding challenge of machine learning. Recently, it has seen incredible success in language, as transformer2 models like BERT,3 GPT-2,4 RoBERTa,5 T5,6 and other variants78910 have achieved top performance on a wide array of language tasks. However, the same broad class of models has not been successful in producing strong features for image classification.11 Our work aims to understand and bridge this gap.