Paper Review: Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains
Let’s talk about furries!
I mean: this is an exciting paper about learning and combining representations of object shape and appearance from the different domains (for example, dogs and cars). This allows to create a model, which borrows different properties from each domain and generates images, which don’t exist in a single domain. The main idea is the following:
- use FineGAN as a base model;
- represent object appearance with a differentiable histogram of visual features;
- optimize the generator so that images with different shapes but similar appearances produce similar histograms;
There were works disentangling representations of the share and appearance, but they were usually intra-domain - for example, combining a sparrow’s appearance with a duck’s shape (domain of birds).
The main challenge of combining shapes and appearances from different domains is that there won’t be ground-truth for this in our dataset - so the model will penalize such images while training.
The base model is FineGAN; its main ideas are:
- 4 latent variables as input: noise vector and one-hot vectors for shapes, appearances, and backgrounds;
- the model generates images in several stages: generate background, draw a silhouette of the shape on the background, generate texture/details inside it;
- the model requires bounding boxes around the objects as an input;
- the model learns constraints: it pairs appearances with shapes, so that, for example, duck shapes are associated with duck appearances, sparrow shapes are associates with sparrow appearances, and so on;
Combining factors from multiple domains
As it was already said: if we generate images with shapes and appearances from different domains, the new images would have a different distribution than the training data and would be penalized.
The authors suggest the following:
- represent the low-level visual concepts (color/texture) using a set of learnable convolutional filters. This representation approximates the frequency of visual concepts represented by the set of filters;
- use contrastive learning: positive sample are pair of images that have the same shape, appearance, and background, but different poses; negative samples are all the others;
- conditional generator, that has positive pairs with different shapes and similar appearances and backgrounds
They also show that this approach works better than CycleGAN or some other models:
The authors also talk about limitations:
- the model assumes that there is a hierarchy between shapes and appearances;
- the more domains there are, the longer is training (as there are more combinations)