Ken Chatfield


5 min read

Share on:

Efficient Learning of Domain-specific Visual Cues with Self-supervision

Introducing Perceptual MAE, a new method for efficiently learning domain-specific visual cues using self-supervision. This work is part of our AI 2.0 initiative and was presented at CVPR 2023.

As part of our work to build models which generalise better over different types of damage to vehicles and property, we have developed a new method which can automatically learn to understand visual concepts (such as ‘cracks’ or ‘dents’) directly from images.

The method we developed, based on techniques from generative modelling:

  • Achieves state-of-the-art classification performance (ranking #2 on ImageNet globally at time of writing with an accuracy of 88.6%)
  • Is significantly more data and compute efficient than alternative methods including the top ranking method (with a model that is over 3x smaller)

We further show these properties generalise across domains and tasks, providing a way to accelerate the creation of performant image classification models in real-world settings, such as those we face at Tractable, with a much reduced requirement for annotated data.

The work was presented recently at CVPR 2023, and we are open-sourcing our approach such that others can build on our approach.

Learning through Generation

To learn from images without relying on any additional information such as labels, we take the following approach: we mask out parts of the image and ask our model to learn how to fill in (or 'generate') the missing patches:

Generating missing patches as a learning task. Perceptual MAE is trained to reconstruct the image of the dog using only the visible patches on the left-hand side, with the right-hand side showing actual output from our trained model for the missing patches. The model learns from the training data what dogs generally look like, enabling it to reconstruct the head and legs (outlined in green on the right-hand side, both occluded on the left-hand side). For a further example of actual model output on real-world data, see the header image of this blog post.

This is an example of using a generative model (as we are learning by generating parts of the image) for what is known as self-supervision (as learning occurs not by trying to predict a separately provided label, but by predicting properties of the image itself).

It turns out that by learning to generate missing patches from an image, knowledge is picked up by the model which is useful for understanding its contents. Therefore, rather than e.g. learning to identify dogs directly from human annotations, by learning to complete images of dogs we build some understanding of what dogs generally look like.

A similar approach is used by language models such as ChatGPT and GPT-4, where by predicting missing words the model learns to generate its own sentences. In the case of images, rather than words we learn to directly predict missing pixel values in the image. This was shown to be effective by the masked autoencoders (MAE) work on which we build.

Addressing the Grounding Problem

What motivated our approach was the observation that by predicting missing pixel values directly, MAE has a tendency to overly focus on getting individual pixels exactly correct. If we visualise what the trained model is attending to, we can see this in the focus on pixel-level details in the background:

The original masked autoencoder (MAE) overly focuses on individual pixels, leading to a diffuse attention map characterised by high attention on background water pixels in this image

However, for ImageNet and other similar datasets where the evaluation task is to identify the contents of the overall image, placing more emphasis on the consistency of how the different parts of the image fit together to form the whole is desirable. This has also been shown to align with the way humans assess the contents of images, and we are inspired by this insight in our work.

We add more global image-level information to Perceptual MAE by following two steps:

  1. We train a separate neural network to assess how natural the overall image is compared to data seen in training (equivalent to the ‘discriminator’ in the generative adversarial learning literature)
  2. Crucially, we then encourage our generation model to use information learnt by this model to guide generation

Step 2 uses a technique known as perceptual loss by feature matching, illustrated in the figure below. This ties the internal representation used by our generation model to those of the hidden layers of the ‘discriminator’ model trained to identify real generations from fake ones:

We use perceptual loss by feature matching: a method which implicitly encourages the internal representation of the data used by the generator network to be similar for images where the contents are the same (e.g. both being of a cat), even if individual pixels may differ

The result is the Perceptual MAE generation model focuses a lot more on object outlines and layout vs the original MAE:

Comparing the attention maps of MAE vs our proposed method Perceptual MAE: our method focuses much more on overall image layout and the outlines of objects

By using knowledge of the task at hand (ImageNet classification) we have ‘shaped’ the focus of our model towards relevant details in the image, in this case overall image semantics. This contributes greatly to the efficiency of our approach.

Performance and Results

Our method achieves boosted performance over ImageNet, setting a new state-of-the-art of 88.1% without using any additional training data.

If we loosen this restriction and use a pre-trained model for feature matching, we can match the recently released DINOv2 method when using the same input image size attaining 88.6% accuracy. This though is with a much smaller model (only 307M parameters) speaking to the efficiency of our method.

A summary of our results compared to recent alternative methods, both with and without additional training data, along with a comparison of the parameter count of each model is shown below:

Performance of Perceptual MAE over ImageNet compared to other recent methods, the number of trainable parameters for each model shown in the grey bars

We found that these results also generalised across different visual tasks, also beating previous methods by a similar margin over object detection and semantic segmentation (for further details see the paper).

We also evaluated performance when applied to different domains such as the tasks we have at Tractable to see if the above results translated to real-world settings. We found that Perceptual MAE trained on domain-specific images could provide a significant boost to performance vs a supervised (purely trained with labels) baseline, particularly when trained with a limited budget of labelled images.

This is shown below when fine-tuning for the Tractable task of vehicle damage assessment across a dataset comprising 500K annotated images:

Accuracy over a Tractable classification task (vehicle damage assessment) when either training with conventional supervised learning or pre-training with Perceptual MAE over a training set of limited size. We obtain improved accuracy when the number of annotated training images is small compared to conventional supervised learning.

Guided learning = more efficient learning 

Recently, there has been a trend in research towards ever larger models. Perceptual MAE demonstrates that with some carefully selected assumptions, it is possible to achieve better or on-par performance with state-of-the-art methods without requiring ever more data and compute.

This also is visible in training, with our method taking around 10x less GPU resources to train than DINOv2. Perceptual MAE also makes the learning of image classification and other tasks on top also much more data efficient, requiring fewer labelled examples.

What’s Next?⁠

This work furthers the path of training performant specialist domain-specific models directly from targeted collections of images in an efficient way, providing an alternative to relying solely on scale as a method for boosting the performance of practical ML systems.

It is also a step towards addressing the robustness and bias issues typically associated with the long-tail when using supervised methods. As part of the Tractable AI 2.0 initiative, we are working to ensure that the computer vision models we train rely directly on expert-defined cues such as ‘cracks’ and ‘scratches’ rather than other superfluous correlations. This work forms one part of this broader initiative.

-> Read the paper

-> Get the code

Discover more related content