Stable Diffusion Explained

Large text to image models have achieved remarkable success in enabling high quality synthesis of images from text prompts. Diffusion models can be applied to text to image generation tasks to achieve state of art image generating results. Stable Diffusion model has achieved state of the art results for image generation. Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion model, proposed in <a href="https://arxiv.org/abs/2112.10752" rel="noopener ugc nofollow" target="_blank">High-Resolution Image Synthesis with Latent Diffusion Models</a> and created by the researchers and engineers from <a href="https://github.com/CompVis" rel="noopener ugc nofollow" target="_blank">CompVis</a>, <a href="https://ommer-lab.com/" rel="noopener ugc nofollow" target="_blank">LMU</a> and <a href="https://runwayml.com/" rel="noopener ugc nofollow" target="_blank">RunwayML</a>. The model was initially trained on 512x512 images from a subset of the <a href="https://laion.ai/blog/laion-5b/" rel="noopener ugc nofollow" target="_blank">LAION-5B</a> database. This is particularly achieved by encoding text inputs into latent vectors using pretrained language models like CLIP. Diffusion models can achieve state-of-the-art results for generating image data from texts. But the process of denoising is very slow and consumes a lot of memory when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. In this regard, latent diffusion can reduce the memory and computational time by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. In latent diffusion, the model is trained to generate latent (compressed) representations of the images. <a href="https://medium.com/@onkarmishra/stable-diffusion-explained-1f101284484d">Learn More</a>