My first time running LoRA training was last year, and this was the very first problem I ran into. I had prepared my training images โ€” tall full-body illustrations, wide landscape images, and square face close-ups โ€” dumped them all into one folder and hit run. An error popped up immediately.

RuntimeError: expected all tensors to be of the same size

At the time, I just thought "Oh, I just need to resize everything to 512ร—512" and moved on. But when I looked at the results, the full-body character's head was cut off and the sides of the landscape images were all smashed together. I didn't understand exactly why at the time. It wasn't until I learned about Aspect Ratio Bucketing that I properly understood both the cause and the solution.

Recently, a researcher named hengtao tantai posted a clean writeup on this topic on Towards AI (original: "How Stable Diffusion Trains Variable-Resolution Images Without PyTorch Errors", 2024.12.13). Since this is essential knowledge for anyone doing LoRA or Dreambooth training in the image generation community, I've put together this explanation mixing his article with my own experience.

A folder containing training images of various aspect ratios โ€” portrait, landscape, and square โ€” with an error message displayed


Why PyTorch Rejects Images of Different Sizes

This part is simple once you understand it, but you'll keep going in circles if you don't.

In PyTorch, a batch is multiple images bundled into a single tensor (a multi-dimensional numerical array). Images are stacked using torch.stack(), and at that point, all images must have exactly the same height, width, and number of channels. If you try to put a 512ร—512 image and a 1024ร—512 image in the same batch, PyTorch refuses. Parallel computation simply doesn't work when array sizes differ.

This isn't just a problem with the original images. Stable Diffusion doesn't train on images directly โ€” it first compresses them through a VAE (Variational Autoencoder) into latent space representations. A 512ร—512 image becomes a 64ร—64 latent representation, while 1024ร—512 becomes 128ร—64. These latent representations also need to be the same size for the UNet to process them. Kernel operations, attention calculations โ€” they all assume identical tensor sizes within a batch.

You might think padding (filling empty space with zeros) could solve this, but it doesn't work during training. The padding regions contaminate gradient calculations. If the model learns that "black borders are a normal part of images," artifacts with meaningless black regions start appearing in generated results.

That's why .