Running AI image generation content on aickyway, questions about prompt writing never stop. But most of them only cover text prompts. Things like "use these keywords" or "write your negative prompt like this." There was a time when that was enough, but after starting to use GPT-4o and Gemini in real work, I realized โ€” using only text means you're only tapping into a fraction of AI's capabilities.

A while ago, while working on the aickyway landing page redesign, I tried getting design feedback from GPT-4o. At first, I just asked in text: "This is a landing page for an AI image generation community site โ€” how should I modify it to increase conversion rates?" The response was textbook UX advice. Things like "make the CTA button stand out" or "add user testimonials." But when I uploaded a screenshot along with it and asked "what element do you think users' eyes are drawn to first on this page?", the specificity of the response completely changed. It pointed out that the visual weight of the hero image was suppressing the CTA button, and that the text contrast ratio on the navigation bar was low โ€” analyses that could never have come from text alone.

After this experience, I became interested in "how to compose prompts by combining text + images + video," and here's my summary.

An AI chat screen where a website screenshot and text prompt are input together, showing the AI analyzing and marking UI elements


Uploading an Image Isn't Multimodal

Let me start by addressing the part most people get wrong.

Uploading an image alongside your prompt is just the starting point. The key is structuring each input to serve a different role. Text guides the direction of image analysis, images provide context for video analysis, and information from video becomes the basis for the next text prompt.

You'll feel the difference immediately when you try it yourself. If you just toss in a single image saying "analyze this design," you get a generic analysis. But if you say "this is a competitor's homepage (Image A), this is our homepage (Image B), this is heatmap data (Image C) โ€” where in B is user drop-off higher than A, and does C support that?", you get a completely different level of response. That's the difference between "stacking" modalities versus just "throwing them in."


Model Architecture Affects Prompt Strategy

I won't go deep into architecture. I covered the vision encoder โ†’ projector โ†’ LLM structure in another post about multimodal LLMs, so here I'll only discuss the differences that directly impact prompt design.

Unified models like Gemini, which process all inputs in a single model from the start, can connect image content and questions fairly well even if you just upload an image and ask "what's wrong?" On the other hand, , where specialized modules process each modality and merge them later, sometimes produce better results when you explicitly point out cross-modal connections, like "look at the top 20% of this image โ€” the navigation bar alignment with the text below seems broken."