Deep learning can classify images efficiently, and transfer learning can also be leveraged to accelerate image classification. In transfer learning, the models are first-pre-trained on bigger datasets like Imagenet. These models learn visual representations, later transferred via fine-tuning to a new task with fewer data.
Previously, researchers could achieve state-of-the-art performance on various classification tasks like the VTAB benchmark. For this, they have employed BiT and Vit.
Fine-tuning is crucial for every new dataset for which task-specific data is required. Multimodal contrastive learning is an alternative that recently rose to fame paradigm that overcomes these challenges. It is done by learning how to match images with free-form text. These models can easily solve new tasks by redefining them as image-text matching issues without extra data ( aka zero-shot learning). Contrastive learning is quite flexible and easy to adapt to new tasks.
The researchers have proposed “LiT: Zero-Shot Transfer with Locked-image Text Tuning with those limitations in mind.” LiT models can efficiently learn to match text to an already pre-trained image encoder. This provides the best of both worlds: strong image representations from pre-training plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, efficiently closing the gap between the two learning styles.
Contrastive Learning on Image-Text Data
The contrastive learning model learns representations from “positive” and “negative” examples. The representations for “positive” examples are similar but differ from those for “negative.”
Multimodal contrastive learning is applied to pairs of images with associated texts. A text encoder calculates representations from texts, while an image encoder calculates representations of images. Every image representation should be similar to its associated text (“positive”) but separate from other texts (“negatives”) in the data. This is usually done using randomly initialized models (“from scratch”), which means that the encoders must learn how to match and represent them simultaneously.
Training can be done using noisy, unaligned pairs of text and images, as they naturally appear on the internet. This eliminates the need to label and allows data scaling manually. The model also learns richer visual concepts, as it is not limited to the space defined by the classification label. Instead of categorizing an image as “coffee,” the model can identify whether it is “a small espresso in a white mug” or “a large red latte in a flask.”
A model that aligns text and image can be used in many different ways once trained. We compare the image representations with the text representations of class names to achieve zero-shot classification. A “wombat or jaguar” classification can be made by computing the text representations “jaguar” and classifying images as jaguars if they match the former. This method can be scaled to thousands of classes, making it simple to solve classification tasks with minimal data. Image search, also known as. Image-text retrieval: This is where you find the image that best matches the text in a given text.
The Best of Both Worlds With Locked-image Tuning
Transfer learning, as mentioned, achieves a state-of-the-art accuracy, but it requires per-task labels and datasets, and training is also required. On the other hand, contrastive models are more flexible and scalable than transfer learning, but they lack performance. The current state-of-the-art for ImageNet classification using transfer learning is 90.94%. However, the best contrastive zero-shot models gain 76.4%.
This is LiT tuning. We train a text model and compare it with pre-trained image encoders to produce well-aligned representations. This is crucial because it assures that the image encoder does not get updated during training. Although this may seem counterintuitive, as one would expect the performance to improve with further training, we have found that locking the image encoder consistently yields better results.
This stage can be considered an alternative to the traditional fine-tuning stage, where each image encoder is individually adapted to each new classification task. Instead, there is one stage of LiTuning, after which the model can categorize any data. The accuracy of ImageNet classification is 84.5% with LiT-tuned models. This is a significant improvement over the previous methods and reduces the gap between contrastive and fine-tuning.
An impressive advantage of contrastive models is enhanced robustness, and they retain high accuracy on datasets that are typically fool fine-tuned models, such as ObjectNet and ImageNet-C. Also, LiT-tuned models have high performance across multiple challenging versions of ImageNet.
LiT-tuning has other benefits. While prior contrastive works need large amounts of data and training for a very long time, the LiT approach is much less data-hungry. LiT models are trained on 24M publicly available image-text pairs rival the zero-shot classification performance of earlier models trained on private data with 400M image-text pairs. The locked image encoder leads to quicker training with a smaller memory footprint. On bigger datasets, image representations can be pre-computed; not running the image model while training further enhances efficiency and unlocks much larger batch sizes, expanding the number of “negatives” the model sees and is key to high-performance contrastive learning. The method works well with varied image pre-training forms (e.g., self-supervised learning) and many publicly available image models. We hope that these benefits make LiT an excellent testbed for researchers.
Conclusion
The research team presented Locked-image Tuning (LiT), which contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This easy method computes efficiently and substantially enhances zero-shot classification performance compared to existing contrastive learning techniques. Many AI-driven techniques like this and the Google PaLM algorithm will help search simpler and more efficiently.
Want to try it yourself?
The researcher team has prepared a small interactive demo for users. You can try some LiT-tuned models and learn. They have also provided a Colab with more refined use cases and larger models, a great way to get started.