Introduction

Running an AI image generation service, the most persistent problem was the oil painting artifact — generated images that weren't sharp, with details smeared as if painted with oil paint. It was especially noticeable in fine details like skin texture, hair strands, and eye highlights.

This post documents the process of tracking down and resolving the oil painting artifact issue in a Stable Diffusion 1.5 pipeline on an RTX 3080 12GB environment.


Operating Environment

Here's a summary of the current server environment:

ItemSpec
GPUNVIDIA GeForce RTX 3080 (12GB VRAM)
GPU ArchitectureAmpere (Compute Capability 8.6)
PyTorch2.4.0+cu118
CUDA11.8
xformers0.0.27.post2+cu118
diffusers0.29.x
ModelStable Diffusion 1.5 (FP16)
Operation Mode2 simultaneous instances (VRAM ≤5GB)

Since we needed to run 2 image generation instances simultaneously on the RTX 3080 12GB, we used enable_model_cpu_offload() to keep VRAM under 5GB. The UNet is loaded to GPU only during the forward pass and returns to CPU afterward.

Dashboard monitoring VRAM usage on RTX 3080 12GB. Two instances each using 4-5GB.

Symptom: "Why Does It Keep Looking Like an Oil Painting?"

Every time I checked the generated images, something felt off. When generating with the same prompt and seed in the original Stable Diffusion WebUI (AUTOMATIC1111), I got sharp results. But images from our service had:

  • Skin smeared to a plastic-like smoothness
  • Loss of distinction between individual hair strands
  • Eye highlights spreading blurrily
  • Cloth wrinkles and textures smoothed out like an oil painting

The symptoms were even worse when Hires.fix (high-resolution correction) was enabled. Details were further blurred during the upscaling process from 512×512 to 1024×1024.

Oil painting artifact comparison. Left: sharp normal image with clear details. Right: image with skin and hair smeared like an oil painting.

Cause 1: cuDNN TF32 Enabled During VAE Decode (Critical)

Discovery Process

For performance optimization, TF32 (TensorFloat-32), an Ampere GPU feature, was globally enabled.

# pipelineCache.py — Ampere GPU optimization
torch.backends.cuda.matmul.allow_tf32 = True   # Matrix multiply ~2x speed boost
torch.backends.cudnn.allow_tf32 = True          # cuDNN operation acceleration
torch.backends.cudnn.benchmark = True           # Auto-select optimal algorithm

TF32 is an Ampere-exclusive feature that maintains the range of FP32 (8-bit exponent) while reducing precision to FP16 level (10-bit mantissa) for faster computation. Since the UNet inference already uses FP16, enabling TF32 there doesn't affect quality.

The problem was the VAE decoder.

We had already created a wrapper to upcast only the VAE decoder to FP32 to prevent oil painting artifacts:

# Original code — only disabling matmul TF32
def _upcast_decode(z, *args, **kwargs):
    prev_tf32 = torch.backends.cuda.matmul.allow_tf32
    torch.backends.cuda.matmul.allow_tf32 = False        # ← Only disabling matmul
    try:
        return _orig_decode(z.to(torch.float32), *args, **kwargs)
    finally:
        torch.backends.cuda.matmul.allow_tf32 = prev_tf32

However, the core operations in the VAE decoder are Conv2d (convolution), not matmul (matrix multiply). Conv2d is processed through cuDNN, and cudnn.allow_tf32 was still True.

Comparison diagram of TF32 and FP32 floating-point formats. TF32 has a 10-bit mantissa, lower precision compared to FP32's 23-bit.

Core Principle

The VAE decoder converts data from latent space to actual pixels. It passes through multiple Conv2d layers, and with TF32's 10-bit precision at each layer, subtle differences in color gradients are lost.

  • FP32 (23-bit mantissa): Can represent color differences down to ~0.00001
  • TF32 (10-bit mantissa): Can only represent color differences down to ~0.001

When this accumulates through multiple Conv2d layers, subtle skin tone variations disappear, resulting in a flat "painted" oil painting effect.

Fix

# Fixed code — disabling matmul TF32 + cuDNN TF32 + benchmark
def _upcast_decode(z, *args, **kwargs):
    prev_matmul_tf32 = torch.backends.cuda.matmul.allow_tf32
    prev_cudnn_tf32 = torch.backends.cudnn.allow_tf32
    prev_cudnn_bench = torch.backends.cudnn.benchmark
    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False       # ← Conv2d also uses FP32!
    torch.backends.cudnn.benchmark = False         # ← Choose numerically stable algorithm
    try:
        return _orig_decode(z.to(torch.float32), *args, **kwargs)
    finally:
        torch.backends.cuda.matmul.allow_tf32 = prev_matmul_tf32
        torch.backends.cudnn.allow_tf32 = prev_cudnn_tf32
        torch.backends.cudnn.benchmark = prev_cudnn_bench

TF32 and benchmark are disabled only during VAE decode, then restored to their original values. Since VAE decode is a small portion of total generation time, the speed impact is negligible.

cudnn.benchmark = False was also added: benchmark mode selects the fastest cuDNN algorithm per input size, but the fastest isn't always the most numerically stable.


Cause 2: prompt_embeds dtype Mismatch in img2img Pipeline (High)

Discovery Process

Comparing the txt2img (text→image) call section with the Hires.fix img2img (image→image) call section:

# txt2img call — explicit fp16 ✅
result = self.pipeline(
    prompt_embeds=self.promptEmbeds.to(torch.float16),
    negative_prompt_embeds=self.negativePromptEmbeds.to(torch.float16),
    ...
)

# img2img call — dtype not specified ❌
result = self.imgPipeline(
    prompt_embeds=self.promptEmbeds,           # ← unknown dtype!
    negative_prompt_embeds=self.negativePromptEmbeds,
    ...
)

The dtype of prompt_embeds generated by Compel (prompt weighting library) depends on the state of the text_encoder. In an enable_model_cpu_offload() environment where text_encoder exists on CPU as float32, Compel-generated embeds also become float32.

When this float32 tensor enters the fp16 UNet, implicit type conversion occurs, causing subtle precision loss.

Fix

# Explicitly specifying fp16 for img2img as well
result = self.imgPipeline(
    prompt_embeds=self.promptEmbeds.to(torch.float16),
    negative_prompt_embeds=self.negativePromptEmbeds.to(torch.float16),
    ...
)

Cause 3: Duplicate enable_model_cpu_offload() Hook Conflict (High)

Discovery Process

Looking at the pipeline cache code, both the txt2img pipeline and the img2img pipeline were calling enable_model_cpu_offload():

pipeline.enable_model_cpu_offload()                          # txt2img

img_pipeline = StableDiffusionImg2ImgPipeline(**pipeline.components)  # Sharing the same VAE!
img_pipeline.enable_model_cpu_offload()                      # ← Duplicate hook registration on same VAE

enable_model_cpu_offload() internally registers hooks from the accelerate library on each module. When two pipelines each register hooks on the same VAE object:

Impact

DPM++ 2M Karras uses a Karras sigma schedule to concentrate steps in high-noise regions, producing detail-rich images even with fewer steps.

PNDM and DDIM use uniform step distribution and have a tendency to smooth details especially during img2img refinement at low denoising strength (0.3~0.4).

Result: the precision details created by DPM++ 2M Karras were being smeared by PNDM img2img. This was the direct cause of the oil painting artifact worsening with Hires.fix.

Fix

def setScheduler(self):
    schedulerType = self.args['scheduler']
    if schedulerType == 'DPM++ 2M Karras':
        scheduler = DPMSolverMultistepScheduler.from_config(
            self.pipeline.scheduler.config,
            use_karras_sigmas=True,
            algorithm_type="dpmsolver++",
        )
        scheduler.config.final_sigmas_type = "sigma_min"
    # ... other scheduler branches ...

    # Key: apply the same scheduler to both txt2img and img2img!
    self.pipeline.scheduler = scheduler
    self.imgPipeline.scheduler = scheduler

Cause 5: BILINEAR Resize After RealESRGAN (Medium)

Discovery Process

Comparing LANCZOS and RealESRGAN approaches in Hires.fix:

# LANCZOS approach — high quality resize ✅
upscaled_image = image.resize((hires_width, hires_height), Image.LANCZOS)

# RealESRGAN approach — why BILINEAR? ❌
sr_image = self.upscaleModel.predict(image)
upscaled_image = sr_image.resize((hires_width, hires_height), Image.BILINEAR)

After 4x upscaling with RealESRGAN, Image.BILINEAR (bilinear interpolation) was used for the resize step to match the target resolution (2x). BILINEAR is a linear blend of neighboring pixels that blurs edges and details.

RealESRGAN's hard-won detail restoration was being smeared again by the BILINEAR resize.

Comparison of BILINEAR and LANCZOS interpolation. BILINEAR has blurry edges and smeared details, while LANCZOS preserves sharp edges and details.