In this work, we propose a method to generatively model the joint distribution of images and corresponding semantic segmentation masks using generative adversarial networks. We extend the Style-GAN architecture by iteratively growing the network during training, to add new output channels that model the semantic segmentation masks. We train the proposed method on a large dataset of fashion images and our experimental evaluation shows that the model produces samples that are coherent and plausible with semantic segmentation masks that closely match the semantics in the image.