Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer

  • Zhihe Lu
  • , Sen He
  • , Xiatian Zhu
  • , Li Zhang
  • , Yi Zhe Song
  • , Tao Xiang

Research output: Contribution to journalConference articlepeer-review

181 Citations (Scopus)

Abstract

A few-shot semantic segmentation model is typically composed of a CNN encoder, a CNN decoder and a simple classifier (separating foreground and background pixels). Most existing methods meta-learn all three model components for fast adaptation to a new class. However, given that as few as a single support set image is available, effective model adaption of all three components to the new class is extremely challenging. In this work we propose to simplify the meta-learning task by focusing solely on the simplest component - the classifier, whilst leaving the encoder and decoder to pre-training. We hypothesize that if we pre-train an off-the-shelf segmentation model over a set of diverse training classes with sufficient annotations, the encoder and decoder can capture rich discriminative features applicable for any unseen classes, rendering the subsequent meta-learning stage unnecessary. For the classifier meta-learning, we introduce a Classifier Weight Transformer (CWT) designed to dynamically adapt the support-set trained classifier’s weights to each query image in an inductive way. Extensive experiments on two standard benchmarks show that despite its simplicity, our method outperforms the state-of-the-art alternatives, often by a large margin. Code is available on https://github.com/zhiheLu/CWT-for-FSS.

Original languageEnglish
Pages (from-to)8721-8730
Number of pages10
JournalProceedings of the IEEE International Conference on Computer Vision
DOIs
Publication statusPublished - 17 Oct 2021
Externally publishedYes
Event18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada
Duration: 11 Oct 202117 Oct 2021

Fingerprint

Dive into the research topics of 'Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer'. Together they form a unique fingerprint.

Cite this