During training, the data loader picks one bucket and pulls out enough images to fill the batch size (usually 4โ8 images). Since they're from the same bucket, they're all the same size, and torch.stack() passes without issues. Latent representations after VAE encoding also match in size, and the UNet processes them normally.
This method is not optional. If you want to train Stable Diffusion with non-square images, you must use this technique. It's built into all major training tools: Kohya_ss, NovelAI, Diffusers, SDXL fine-tuning scripts, and more.

How Many Buckets to Use
This is where it gets tricky when actually running training. Too few buckets and you can't accommodate images with diverse aspect ratios. Too many and the number of images per bucket drops, making it hard to fill batches. Smaller batch sizes lead to unstable gradients and lower training quality.
Looking at Kohya_ss's default settings, bucket resolution intervals are set in 64-pixel increments, which produces roughly 20โ30 buckets. This works well when you have thousands of training images, but for small-scale LoRA training with fewer than 100 images, splitting buckets too finely can leave only 2โ3 images per bucket. At that point, you either reduce the number of buckets or lower the batch size to 1โ2, but the latter affects training stability.
In my case, when training character LoRAs I typically use 20โ50 images. From experience, widening the bucket resolution interval to 128 cuts the number of buckets in half while increasing images per bucket, making training more stable. Of course, this comes at the cost of a slightly larger gap between original aspect ratios and actual training resolutions, but when image counts are low, it seems like the better compromise.
SDXL's Pixel Budget System
This is an improvement introduced in SDXL, and it's quite clever.
Instead of fixed resolutions, SDXL uses the concept of a "target pixel count (pixel budget)." With a target of roughly 1024ยฒ โ 1 million pixels, it calculates the resolution that fits each aspect ratio. Two constraints apply:
- Both width and height must be divisible by 64
- The total pixel count must be close to the target
The reason they must be divisible by 64 is because of the VAE. Stable Diffusion's VAE downsamples images by 8ร, so aligning to 64-pixel increments (a multiple of 8) ensures latent representation dimensions come out as clean integers. If the height were 830, then 830รท8 = 103.75, which can't produce an integer-sized latent representation, breaking the bucketing entirely.
Taking a 3:2 ratio as an example, a naive calculation gives 1536ร1024 (about 1.57 million pixels), which far exceeds the 1 million pixel target. Instead, 1216ร832 is used. 1216รท64=19 and 832รท64=13 โ both divide evenly, and the latent representation becomes 152ร104, processing cleanly. The total pixel count is about 1.01 million, close to the target.
Thanks to this system, SDXL can train on images with diverse aspect ratios while maintaining tensor consistency in latent space.

The Pre-Bucketing Era โ Why SD 1.x Outputs Had So Many Chopped Heads
Anyone who's been doing AI image generation since the early days will relate to this.
In the SD 1.x and 2.x era, there was no bucketing โ all training images were force-cropped to 512ร512. A 2000ร1000 wide landscape photo had its sides chopped off, and tall full-body illustrations had their tops and bottoms cut. Heads getting sliced off in portraits and roofs disappearing from building photos was an everyday occurrence.
The problem wasn't just that it "looked bad." When a model trains on cropped images, it learns that "it's normal for a person's head to be cut off at the top of the frame." The phenomenon where generated portraits have heads extending beyond the frame, or you ask for a full body and the feet get cut off โ that came from here. The model wasn't bad; the training data was already broken.
Among the images posted on aickyway, if you look at SD 1.5-based outputs, you'll often see full-body characters with cut-off feet or unnatural spacing above the head. The dramatic reduction of this problem in SDXL and later models isn't just due to model size or parameter count โ it's largely because bucketing preserved the original composition in training data.
This was a change closer to "bug fix" than "performance improvement."

Changes in DiT Architecture โ Inference Is Flexible, but Training Is Still Fixed
Recent models like Stable Diffusion 3, Hunyuan-DiT, and Wan use DiT (Diffusion Transformer) architecture instead of UNet. One important feature of DiT is 2D RoPE (Rotary Position Embedding), which assigns 2D positional information to each token in latent space. This allows a 128ร64 grid and a 152ร104 grid to be interpreted as different spatial structures, enabling inference at resolutions never seen during training.
For example, a model trained only with 1024ร512 and 512ร512 buckets can generate 2048ร1024 images during inference. This is because 2D RoPE dynamically adjusts position encodings to match the new token count.
However, this flexibility is limited to inference only. During training, bucketing is still used. Gradient computation, attention caching, and memory alignment all require identical tensor sizes within a batch. Just because DiT can structurally handle variable-length sequences doesn't mean PyTorch's batch constraints disappear during training.
Without understanding this distinction, you'll fall into misconceptions like "DiT models can handle any resolution, so we don't need bucketing, right?" Being flexible at inference and being flexible at training are separate issues.
Practical Tips for Bucketing Settings
From here on, this content isn't from the original article โ it's from my own experience with LoRA training.
Check aspect ratios during the image preprocessing stage. If extreme ratios (e.g., 4:1 panoramas) are mixed into your training image folder, those buckets will only have 1โ2 images, making batch construction inefficient. It's better to exclude such images from training or crop them appropriately before including them.
In Kohya_ss, it's best to enable the --bucket_no_upscale option. Without it, images smaller than the original get upscaled to fit into buckets, but upscaling low-resolution images doesn't add information โ it just adds blur. This option makes a noticeable difference, especially when training image quality is uneven.
Regularization images must follow the same bucket structure. If you feed training images with diverse ratios but use only square regularization images, regularization weakens at certain ratios, causing overfitting patterns to vary by aspect ratio. Ideally, the regularization set should match a similar ratio distribution as the training set.
Conclusion
What struck me while putting this article together is that bucketing is ultimately "bypassing framework constraints in the data pipeline, not overcoming model limitations."
PyTorch's requirement for uniform tensor sizes hasn't changed, and the VAE's 8ร downsampling structure remains the same. What bucketing does is enable the use of images with diverse aspect ratios for training within those constraints, and SDXL's pixel budget system refined that further.
For anyone involved in AI image generation who's interested in LoRA training, bucketing isn't "nice to know" โ it's "your results will be broken without it." Prepare training images with diverse aspect ratios, but adjust bucket settings to match your image count and ratio distribution. When image counts are low, widen the bucket intervals and exclude extreme ratios to reduce the chance of failure.