Postulate is the best way to take and share notes for classes, research, and other learning.
Paper Summary of: "Conditional Image Generation with PixelCNN Decoders"
Abstract
Generates new images from an image density model. We create the image density model through vectors, tags, labels, or even embeddings of other networks -> It can do a ton with this:
We can feed it labels: It could generate new scenes of that label (ex: Tiger)
We can feed it the embedding of an image of a face -> Creates new portraits.
And it can be a decoder for image autoencoders
And it can match the state of the art for image classification in ImageNet
Introduction:
Right now the generative networks don't take any inputs or constrains on them. They just generate stuff!
There are lots of applications of generation that require conditions to be set (aka inputs).
PixelCNN builds off of Pixel RNN and seeks to improve it. Both are conditional and generative
Unique value = returns probability densities so it's easy to apply to compression + probabilistic planning
There are two ways of doing this: 2D LSTMs or CNN's. LSTMs are more accurate but CNN's are faster -> We'll combine them both to have the accuracy of the 2D LSTM and the training speed of the CNN
We can literally feed it a one hot encoding of the class, and it can spit out generated images. Or generate new images given the embeddings
Gated PixelCNN
This is the combination of PixelCNN and PixelRNN
This is what it looks like:
Think of a CNN. It has a kernel which looks at some block right?
The middle image shows where the CNN is actually allowed to look. 1 = open, you can see it. 0 = closed you can't see it
Now the CNN will take all the pixels that it does see and produce a joint distribution (think of it like each of the previous variables have some value to it, and then you just multiply them all together to get a concatenation of the variables to then predict the next pixel
The right part is just showing that the kernel can only see what's above it, and what's left to it
Since the CNN can see the previous pixels, it can generate high quality images in a non-linear fashion
How we make this to the level of PixelRNN
Pixel RNNs can access the whole field while CNN's can't. It just grows linearly. We can surpass that by using more layers
Also RNNs have multiplicative units that increase complexity. We can do the same by replacing the RELU activation with a gated activation unit:
Blind spots
There are also apparently blind spots in this model
They just had 2 CNNs that were ultimately combined together to cover the blind spot
Conditional PixelCNN
Remember that we're generating specific images, so we can't just let the CNN go off on it's own
We just input the latent vector h into everything - both the CNN model distribution, and the activation function
Also we can just replace the decoding in autoencoders with the PixelCNN and it works! PixelCNN does the heavy lifting in one area, so the Ender just focuses on high-level abstract information
Experimentation
For CIFAR-10, it (Gated PixelCNN) achieved state of the art
ImageNet same thing. (for unconditional modeling)
Then they tested out the conditional one (feed in a one-hot encoding)
Then they tested out the portrait embeddings
They can also paly with linear interpolations:
Also trained autoencoders (m = number of dimensions of the bottleneck)
Conclusion
They just summarize the paper
In the future they want generate new ones from only 1 example image
Also they want to do variational auto-encoders
They could also try image caption instead of labels.
A Collection of Summaries of my favourite ML papers!