StackGAN Results

StackGAN has got nearly 200 citations since first appeared on arXiv on December 10, 2016. 

The StackGAN is the first to generate 256*256 image with photo-realistic details from text description.


Generative Adversarial Network (GAN), originally proposed by Ian. It takes advantage of a generator network and a discriminator. The generator is trained to fool the discriminator by improving the generated image.


The main difficulty for generating high-resolution images by GANs is support of natural image distribution and the support of the implied model distribution may not overlap in high dimensional pixel space.

In StackGAN, there are two stages of GAN procedure. The Stage-I generator draws a low-resolution image by sketching rough shape and basic colors of the object from the given text and painting the background from a random noise vector. Conditioned on Stage-I results, the Stage-II generator corrects defects and adds compelling details into Stage-I results, yielding a more realistic high-resolution image.

In Stage-I, stackGAN does not use the embedding space as condition, but apply a FC layer to obtain a normal distribution Z~N(0, 1), and use the samples to generate the condition. This is because, the dimensionality of the embedding space is usually much higher than the text. If we use the embedding space directly as the condition, the latent variable in the latent space will be sparse.

In the generator, instead of using Deconv, but several 3×3 conv

In the discriminator, it uses several conv with step=2, and combine with resized embedding space.


In Stage-II, it combined the downsampled samples form Stage-I, and augmented embedding (sampled from Gaussian) as input; with seeral residual blocks, use the same upsampling technique to obtain the pictures.

With more labels, generator can decomposite a complicated distribution into several simple distributions with low dimensionality.


It obtained the state-of-the-art inception score (IS), with 28.47% and 20.30% improvement. The Inception Score is a metric for automatically evaluating the quality of image generative models. [Salimans et al., 2016]. This metric was shown to correlate well with human scoring of the realism of generated images from the CIFAR-10 dataset. The IS uses an Inception v3 Network pre-trained on ImageNet and calculates a statistic of the network’s outputs when applied to generated images.

$IS(G) = \exp{ E_{x\approx p_g} D_{KL} (p(y|x) || p(y) }$

where $x~p_g$ indicates that x is an image sampled from p_g, D_{KL} (p(y|x) || p(y) is the KL-divergence between the distributions p and q, p(y|x) is the conditional class distribution, and p(y)  is the marginal class distribution. 

The authors who proposed the IS aimed to codify two desireable qualities of a generative model into a metric:

  1. The image generated should contain clear objects (i.e. the images are sharp rather than blurry), or P(y|x) should be low entropy. In other words, the inception network should be highly confident there is a single object in the image.
  2. The generative algorithm shouldd output a high diversity of images from all the different classes in ImageNet, or p(y) should be high entropy.


The code of StackGAN is release at Github here.


Conditional GAN is one of the earliest variants of GAN:

$\max_D { E_x~P_data log D(x|y) + E_{x~P_G} log(1-(D(x|y)) } $

where the condition could be pictures, labels


In pix2pix, it uses paired pictures as the condition.


In Cycle GAN and Disco GAN, without using the paired data, they transfer styles in different domains and sizes.


[1]Mirza, Mehdi, and Simon Osindero. “Conditional generative adversarial nets.” arXiv preprint arXiv:1411.1784 (2014).

[2] Isola, Phillip, et al. “Image-to-image translation with conditional adversarial networks.” arXiv preprint (2017).

[3]Zhu, Jun-Yan, et al. “Unpaired image-to-image translation using cycle-consistent adversarial networks.” arXiv preprint arXiv:1703.10593 (2017).

[4]Kim T, Cha M, Kim H, et al. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks[J]. 2017.