Momentum Contrast for Unsupervised Visual Representation Learning.


This paper introduces a new contrastive learning approach – momentum contrast (MoCo). The key ideas are: 1) implement a queue as the dictionary to store a large number of keys; 2) update the key encoder using the momentum update rule. Compared with previous contrastive learning methods based on memory bank and end-to-end learning, MoCo not only supports a large negative sample size but also maintains a consistent key encoding. Through an ablation study, the authors verified the effectiveness of the proposed queue dictionary and momentum encoder update. When using a linear classifier on the extracted features from MoCo, the classification accuracy is competitive with SOTA methods. Moreover, MoCo does not require customized architectures for the pretext tasks, which makes it easier to transfer features for downstream tasks. The authors further tested the transferability of the learned feature from MoCo; experiment results show that MoCo outperforms the supervised learning approach on seven detection and segmentation tasks. The authors claim that MoCo has almost pushed the performance of unsupervised learning to the same level as supervised learning in terms of learning useful features.


This paper reports a very encouraging progress on unsupervised learning, and it will likely be very influential. The key contributions of this paper enable contrastive learning to learn useful representations for many downstream visual tasks. Experiments are extensive and convincing. The authors’ claims are well supported by experiment results. The PyTorch-style pseudocode is my favorite part, which delivers the key idea of this paper in a precise and concise manner. Such contrastive learning methods could be applied to many other domains, such as NLP and RL. A recent ICML paper has shown that contrastively learned features could boost policy learning significantly [1].


Overall, I feel this is a very high-quality paper, and I didn’t spot any significant weakness of this paper. However, one inherent issue of contrastive learning approaches is that they are very computation and memory intensive due to the requirement of a large number of negative pairs. It would be important to improve its memory and computation efficiency to allow a broader adoption for other applications.


  1. Laskin, Michael, Aravind Srinivas, and Pieter Abbeel. “CURL: Contrastive Unsupervised Representations for Reinforcement Learning.” In the proceedings of the International Conference on Machine Learning (ICML). 2020.


PointRend: Image Segmentation as Rendering

1 minute read

Summary Image segmentation based on convolutional neural networks is often based on regular grids. However, because the label map for interior regions of det...

Decoupled Weight Decay Regularization.

1 minute read

Summary In DL literature, researchers often use weight decay and L2 regularization interchangeably. It’s because they are equivalent when using SGD as the re...

Desining Network Design Space.

2 minute read

Summary This paper introduces a method for designing the design space of deep neural networks. The core idea is to sample network designs from a design space...

Back to Top ↑