Text-to-Image with StableDiffusionPipeline

February 10, 2024 - 3 mins read

In this post, I’ll delve into the capabilities of the StableDiffusionPipeline for generating photorealistic images based on textual inputs.

Text-to-Image

Continuing from the previous post, I initiated the environment setup:

cd stable-diffusion
conda activate ldm

Subsequently, I installed the necessary libraries, diffusers and transformers:

pip install --upgrade diffusers[torch] transformers

To begin, let’s explore the diffusion pipeline:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline.to("cuda")
image = pipeline("An image of a squirrel in Picasso style").images[0]
image.save("squirrel-image.jpg")

squirrel-image

Textual Inversion

Referencing from Textual inversion, the StableDiffusionPipeline supports a fascinating technique allowing models like Stable Diffusion to learn new concepts from a few images.

To employ Textual Inversion embedding vectors, as outlined in Image-to-image, let’s download the charturner v2 embeddings:

wget https://huggingface.co/AmornthepKladmee/embeddings/resolve/main/charturnerv2.pt

Now, let’s apply textual inversion:

from diffusers import StableDiffusionPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

pipe.load_textual_inversion("./charturnerv2.pt", token="charturnerv2")
prompt = "charturnerv2, multiple views of the same character in the same outfit, a fit character for a RPG game in best quality, intricate details."
image = pipe(prompt, num_inference_steps=50).images[0]
image.save("character.png")

character

Image to Image

Next, leveraging the previously generated carton-insect.png from the earlier post, let’s explore the image-to-image pipeline:

from PIL import Image
from diffusers import StableDiffusionImg2ImgPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

init_image = Image.open("cartoon-insect.png").convert("RGB")

pipe.load_textual_inversion("./charturnerv2.pt", token="charturnerv2")

prompt = "charturnerv2, cartoon insect"
images = pipe(prompt, image=init_image, strength=0.75, guidance_scale=7.5, num_inference_steps=50).images
images[0].save("cartoon-insect-1.png")

cartoon-insect

cartoon-insect-1

Animagine XL 2.0

Introducing Animagine XL 2.0 an advanced latent text-to-image diffusion model tailored for creating high-resolution anime images. It is fine-tuned from Stable Diffusion XL 1.0 (SDXL) using a premium anime-style image dataset.

Let’s try out the sample code with some tweaks:

import torch
from diffusers import (
    StableDiffusionXLPipeline, 
    EulerAncestralDiscreteScheduler,
    AutoencoderKL
)

# Load VAE component
vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix", 
    torch_dtype=torch.float16
)

# Configure the pipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
    "Linaqruf/animagine-xl-2.0", 
    vae=vae,
    torch_dtype=torch.float16, 
    use_safetensors=True, 
    variant="fp16"
)
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.to('cuda')

# Define prompts and generate image
prompt = "face focus, cute, masterpiece, best quality, 1girl, red hair, sweater, looking at viewer, upper body, smiley, outdoors, daylight, blouse, earings"
negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, hat"

image = pipe(
    prompt, 
    negative_prompt=negative_prompt, 
    width=1024,
    height=1024,
    guidance_scale=12,
    num_inference_steps=50
).images[0]
image.save("./animagine.png")

animagine

Stable Diffusion XL

Stable Diffusion XL (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion Models.

Let’s try with the text-to-image by passing the prompt. By default, SDXL generates a 1024x1024 image for the best results.

from diffusers import AutoPipelineForText2Image
import torch

pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Majestic dragon flying, huge fireworks in the form of Happy CNY 2024, detailed, 8k"
image = pipeline_text2image(prompt=prompt).images[0]
image.save("majestic-dragon.png")

majestic-dragon

Wishing everyone a joyous and prosperous Chinese New Year 2024! Huat ah! 🎉🐉