Self-Supervised Learning (SSL) – A Gentle Introduction

March 2022

1. Overview of Self Supervised Learning (SSL) / SSL Basics

Self-supervised learning, also known as self-supervision, is an emerging solution to a common ML problem of needs lots of human-annotated data. In my opinion, it’s one of the next big breakthroughs in large-scale machine learning and I see it dominating the production-grade models that Google, Meta, OpenAI, and Microsoft (“the AI superpowers”) are quietly releasing.

In SSL, data labeling is automated, and human interaction is eliminated as a bottleneck, so we can scale from datasets of thousands or tens of thousands of examples to billions and really use the power of large-scale ML training clusters (which are basically supercomputers). In self-supervised learning, the learning model trains itself by leveraging one part of the data to predict the other part and generate labels accurately. In the end, this learning method converts an unsupervised learning problem into a supervised one. My teams working on SOTA ML problems see it as one of the generalized ways to get beyond the supervised approaches that deep neural networks use today, and see huge performance increases for tasks like content recommendation, medical diagnosis, level 5 AVs, virtual assistants interactions, etc.

Specifically, SSL is:

A form of unsupervised learning where the data provides the supervision (use hashtags from IG posts, GPS locations, or other structure of data or metadata)
In general, withhold some part of the data, and task the network with predicting it (obtain labels from data using a semi-automatic process)
The task defines a proxy loss, and the network is forced to learn what we really care about, e.g. a semantic representation, in order to solve it

Self-supervised learning enables AI systems to learn from orders of magnitude more data, which is important to recognize and understand patterns of more subtle, less common representations of the world. Self-supervised learning has long had great success in advancing the field of natural language processing (NLP), including theCollobert-Weston 2008 model, Word2Vec, GloVE, fastText, and, more recently, BERT, RoBERTa, XLM-R, and others. Systems pretrained this way yield considerably higher performance than when solely trained in a supervised manner. Below is an example of a self-supervised learning output.

Image Prediction/Generation is One Example of an SSL Task

2. Self-supervised Learning vs Supervised Learning

Supervised learning is still the most common type of machine learning, where data is tagged by an expert, e.g. as a “ball” or “fish”, unsupervised methods exhibit self-organization that captures patterns as probability densities or a combination of neural feature preferences. The common characteristic of supervised and self-supervised learning is that both methods build learning models from training datasets with their labels. However, self-supervised learning doesn’t require the manual addition of labels since it generates them by itself. Also, 10% of concepts account for 93% of supervised data – the long tail is left out, which means that real-world applications, supervised learning falls short and won’t scale (imagine doing disease diagnosis if you can only get the 10 most common diseases, but not the 90% of lesser diagnosed ones).

Self-supervised learning vs semi-supervised learning is similar. Semi-supervised learning uses manually labeled training data for supervised learning and unsupervised learning approaches for unlabeled data to generate a model that leverages existing labels but builds a model that can make predictions beyond the labeled data. Self-supervised learning relies completely on data that lacks manually generated labels.

3. Self-supervised Learning vs Unsupervised Learning

Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it.

Self-supervised learning is similar to unsupervised learning because both techniques work with datasets that don’t have manually added labels. In some sources, self-supervised learning is basically a subset of unsupervised learning. However, most unsupervised learning concentrates on clustering, grouping, and dimensionality reduction, while self-supervised learning aims to draw conclusions for regression and classification tasks.

4. Hybrid Approaches vs. Self-supervised Learning

There are also hybrid approaches that combine automated data labeling tools with supervised learning. In such methods, computers can label data points that are easier-to-label relying on their training data, and leave the complex ones to humans. Or, they can label all data points automatically but need human approval. In self-supervised learning, automated data labeling is embedded in the training model. The dataset is labeled as part of the learning processes; thus, it doesn’t ask for human approval or only label the simple data points.

5. Early Use Cases/Applications of SSL

SEER: (SElf-supERvised) is a new billion-parameter self supervised computer vision model that can learn from any random group of images on the internet — without the need for careful curation and labeling that goes into most computer vision training today. It’s basically a foundation model for many vision-related AI use cases and has SSL at its core.

Other early and important SSL use cases are:

Colorization: The technology can be used for coloring grayscale images.
Context Filling: The technology can fill a space in an image or predict a gap in a voice recording or a text.
Video Motion Prediction: Self-supervised learning can provide a distribution of all possible video frames after a specific frame.
Healthcare: This technology can help robotic surgeries perform better by estimating dense depth in the human body. It can also provide better medical visuals with improved computer vision technologies such as colorization and context filling.
Autonomous vehicle (AV) driving: Self-supervised learning can be used in estimating the roughness of the terrain. It can also be useful for depth completion to identify the distance to the other cars, people, or other objects while driving.
Virtual assistants: Self-supervised systems can also be applied to virtual assistants. Transformers, a chatbot that leverages self-supervised learning, is successful in processing words and mathematical symbols easily. However, it is still far from understanding human language.

Word2Vec and BERT are 2 Important SSL Models

6. Problems and Limitations to SSL

Building models can be more computationally intense. Learning models with labels can be built much faster compared to unlabeled learning models. Plus, self-supervised learning autonomously generates labels for the given dataset, which is an additional task. Therefore, compared to other learning methods, self-supervised learning can demand more computing resources.

Inaccurate labels might lead to inaccurate results. You always achieve the best results when you already have labels of your dataset. Self-supervised learning is a solution for when you don’t have any and need to generate them manually. However, this learning can come up with inaccurate labels while processing and those inaccuracies can lead to inaccurate results for your task. Thus, labeling accuracy is an additional factor to consider while improving self-supervised models.

7. Why Self-Supervised Learning?

Some of the reasons why SSL matters so much:

The high expense in supervised learning of producing a new dataset for each new task
Some areas of ML are hard to get annotations, e.g. medical data, where it is hard to obtain annotation
SSL uses untapped/availability of vast numbers of unlabelled images/videos. As an example, Facebook has: one billion images uploaded per day300 hours of video are uploaded to YouTube every minute
SSL is likely how infants learn
It leverages multiple modalities to learn labels
SSL can learn things that human have a hard time labelling, like intrinsic style quality of videos

Self-supervised learning empowers us to exploit a variety of labels that come with the data for free. The motivation is quite straightforward. Producing a dataset with clean labels is expensive but unlabeled data is being generated all the time. To make use of this much larger amount of unlabeled data, one way is to set the learning objectives properly so as to get supervision from the data itself.

The self-supervised task, also known as a pretext task, guides us to a supervised loss function. However, we usually don’t care about the final performance of this invented task. Rather we are interested in the learned intermediate representation with the expectation that this representation can carry good semantic or structural meanings and can be beneficial to a variety of practical downstream tasks. For example, we might rotate images at random and train a model to predict how each input image is rotated. The rotation prediction task is made-up, so the actual accuracy is unimportant, like how we treat auxiliary tasks. But we expect the model to learn high-quality latent variables for real-world tasks, such as constructing an object recognition classifier with very few labeled samples.

There’s a limit to how far the field of AI can go with supervised learning alone. Supervised learning is a bottleneck for building more intelligent generalist models that can do multiple tasks and acquire new skills without massive amounts of labeled data. Practically speaking, it’s impossible to label everything in the world. There are also some tasks for which there’s simply not enough labeled data, such as training translation systems for low-resource languages. If AI systems can glean a deeper, more nuanced understanding of reality beyond what’s specified in the training data set, they’ll be more useful and ultimately bring AI closer to human-level intelligence. As babies, we learn how the world works largely by observation. We form generalized predictive models about objects in the world by learning concepts such as object permanence and gravity. Later in life, we observe the world, act on it, observe again, and build hypotheses to explain how our actions change our environment by trial and error.

A working hypothesis is that generalized knowledge about the world, or common sense, forms the bulk of biological intelligence in both humans and animals. This common-sense ability is taken for granted in humans and animals but has remained an open challenge in AI research since its inception. In a way, common sense is the dark matter of artificial intelligence.

Common sense helps people learn new skills without requiring massive amounts of teaching for every single task. For example, if we show just a few drawings of cows to small children, they’ll eventually be able to recognize any cow they see. By contrast, AI systems trained with supervised learning require many examples of cow images and might still fail to classify cows in unusual situations, such as lying on a beach. How do humans learn to drive a car in about 20 hours of practice with very little supervision, while fully autonomous driving still eludes our best AI systems trained with thousands of hours of data from human drivers? The short answer is that humans rely on their previously acquired background knowledge of how the world works.

Further Resources on SSL