Introduction
Running an AI image generation service, the most persistent problem was the oil painting artifact — generated images that weren't sharp, with details smeared as if painted with oil paint. It was especially noticeable in fine details like skin texture, hair strands, and eye highlights.
This post documents the process of tracking down and resolving the oil painting artifact issue in a Stable Diffusion 1.5 pipeline on an RTX 3080 12GB environment.
Operating Environment
Here's a summary of the current server environment:
| Item | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 3080 (12GB VRAM) |
| GPU Architecture | Ampere (Compute Capability 8.6) |
| PyTorch | 2.4.0+cu118 |
| CUDA | 11.8 |
| xformers | 0.0.27.post2+cu118 |
| diffusers | 0.29.x |
| Model | Stable Diffusion 1.5 (FP16) |
| Operation Mode | 2 simultaneous instances (VRAM ≤5GB) |
Since we needed to run 2 image generation instances simultaneously on the RTX 3080 12GB, we used enable_model_cpu_offload() to keep VRAM under 5GB. The UNet is loaded to GPU only during the forward pass and returns to CPU afterward.
Symptom: "Why Does It Keep Looking Like an Oil Painting?"
Every time I checked the generated images, something felt off. When generating with the same prompt and seed in the original Stable Diffusion WebUI (AUTOMATIC1111), I got sharp results. But images from our service had:
- Skin smeared to a plastic-like smoothness
- Loss of distinction between individual hair strands
- Eye highlights spreading blurrily
- Cloth wrinkles and textures smoothed out like an oil painting
The symptoms were even worse when Hires.fix (high-resolution correction) was enabled. Details were further blurred during the upscaling process from 512×512 to 1024×1024.
Cause 1: cuDNN TF32 Enabled During VAE Decode (Critical)
Discovery Process
For performance optimization, TF32 (TensorFloat-32), an Ampere GPU feature, was globally enabled.
# pipelineCache.py — Ampere GPU optimization
torch.backends.cuda.matmul.allow_tf32 = True # Matrix multiply ~2x speed boost
torch.backends.cudnn.allow_tf32 = True # cuDNN operation acceleration
torch.backends.cudnn.benchmark = True # Auto-select optimal algorithm
TF32 is an Ampere-exclusive feature that maintains the range of FP32 (8-bit exponent) while reducing precision to FP16 level (10-bit mantissa) for faster computation. Since the UNet inference already uses FP16, enabling TF32 there doesn't affect quality.
The problem was the VAE decoder.
We had already created a wrapper to upcast only the VAE decoder to FP32 to prevent oil painting artifacts:
# Original code — only disabling matmul TF32
def _upcast_decode(z, *args, **kwargs):
prev_tf32 = torch.backends.cuda.matmul.allow_tf32
torch.backends.cuda.matmul.allow_tf32 = False # ← Only disabling matmul
try:
return _orig_decode(z.to(torch.float32), *args, **kwargs)
finally:
torch.backends.cuda.matmul.allow_tf32 = prev_tf32
However, the core operations in the VAE decoder are Conv2d (convolution), not matmul (matrix multiply). Conv2d is processed through cuDNN, and cudnn.allow_tf32 was still True.
Core Principle
The VAE decoder converts data from latent space to actual pixels. It passes through multiple Conv2d layers, and with TF32's 10-bit precision at each layer, subtle differences in color gradients are lost.
- FP32 (23-bit mantissa): Can represent color differences down to ~0.00001
- TF32 (10-bit mantissa): Can only represent color differences down to ~0.001
When this accumulates through multiple Conv2d layers, subtle skin tone variations disappear, resulting in a flat "painted" oil painting effect.
Fix
# Fixed code — disabling matmul TF32 + cuDNN TF32 + benchmark
def _upcast_decode(z, *args, **kwargs):
prev_matmul_tf32 = torch.backends.cuda.matmul.allow_tf32
prev_cudnn_tf32 = torch.backends.cudnn.allow_tf32
prev_cudnn_bench = torch.backends.cudnn.benchmark
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False # ← Conv2d also uses FP32!
torch.backends.cudnn.benchmark = False # ← Choose numerically stable algorithm
try:
return _orig_decode(z.to(torch.float32), *args, **kwargs)
finally:
torch.backends.cuda.matmul.allow_tf32 = prev_matmul_tf32
torch.backends.cudnn.allow_tf32 = prev_cudnn_tf32
torch.backends.cudnn.benchmark = prev_cudnn_bench
TF32 and benchmark are disabled only during VAE decode, then restored to their original values. Since VAE decode is a small portion of total generation time, the speed impact is negligible.
cudnn.benchmark = False was also added: benchmark mode selects the fastest cuDNN algorithm per input size, but the fastest isn't always the most numerically stable.
Cause 2: prompt_embeds dtype Mismatch in img2img Pipeline (High)
Discovery Process
Comparing the txt2img (text→image) call section with the Hires.fix img2img (image→image) call section:
# txt2img call — explicit fp16 ✅
result = self.pipeline(
prompt_embeds=self.promptEmbeds.to(torch.float16),
negative_prompt_embeds=self.negativePromptEmbeds.to(torch.float16),
...
)
# img2img call — dtype not specified ❌
result = self.imgPipeline(
prompt_embeds=self.promptEmbeds, # ← unknown dtype!
negative_prompt_embeds=self.negativePromptEmbeds,
...
)
The dtype of prompt_embeds generated by Compel (prompt weighting library) depends on the state of the text_encoder. In an enable_model_cpu_offload() environment where text_encoder exists on CPU as float32, Compel-generated embeds also become float32.
When this float32 tensor enters the fp16 UNet, implicit type conversion occurs, causing subtle precision loss.
Fix
# Explicitly specifying fp16 for img2img as well
result = self.imgPipeline(
prompt_embeds=self.promptEmbeds.to(torch.float16),
negative_prompt_embeds=self.negativePromptEmbeds.to(torch.float16),
...
)
Cause 3: Duplicate enable_model_cpu_offload() Hook Conflict (High)
Discovery Process
Looking at the pipeline cache code, both the txt2img pipeline and the img2img pipeline were calling enable_model_cpu_offload():
pipeline.enable_model_cpu_offload() # txt2img
img_pipeline = StableDiffusionImg2ImgPipeline(**pipeline.components) # Sharing the same VAE!
img_pipeline.enable_model_cpu_offload() # ← Duplicate hook registration on same VAE
enable_model_cpu_offload() internally registers hooks from the accelerate library on each module. When two pipelines each register hooks on the same VAE object:
Impact
DPM++ 2M Karras uses a Karras sigma schedule to concentrate steps in high-noise regions, producing detail-rich images even with fewer steps.
PNDM and DDIM use uniform step distribution and have a tendency to smooth details especially during img2img refinement at low denoising strength (0.3~0.4).
Result: the precision details created by DPM++ 2M Karras were being smeared by PNDM img2img. This was the direct cause of the oil painting artifact worsening with Hires.fix.
Fix
def setScheduler(self):
schedulerType = self.args['scheduler']
if schedulerType == 'DPM++ 2M Karras':
scheduler = DPMSolverMultistepScheduler.from_config(
self.pipeline.scheduler.config,
use_karras_sigmas=True,
algorithm_type="dpmsolver++",
)
scheduler.config.final_sigmas_type = "sigma_min"
# ... other scheduler branches ...
# Key: apply the same scheduler to both txt2img and img2img!
self.pipeline.scheduler = scheduler
self.imgPipeline.scheduler = scheduler
Cause 5: BILINEAR Resize After RealESRGAN (Medium)
Discovery Process
Comparing LANCZOS and RealESRGAN approaches in Hires.fix:
# LANCZOS approach — high quality resize ✅
upscaled_image = image.resize((hires_width, hires_height), Image.LANCZOS)
# RealESRGAN approach — why BILINEAR? ❌
sr_image = self.upscaleModel.predict(image)
upscaled_image = sr_image.resize((hires_width, hires_height), Image.BILINEAR)
After 4x upscaling with RealESRGAN, Image.BILINEAR (bilinear interpolation) was used for the resize step to match the target resolution (2x). BILINEAR is a linear blend of neighboring pixels that blurs edges and details.
RealESRGAN's hard-won detail restoration was being smeared again by the BILINEAR resize.











