Training Methodologies
This section is based on the original article and papers rather than my own implementation experience.
There are four main ways multimodal models learn the "relationship between images and text."
Contrastive Learning โ The method used by CLIP. Matching image-text pairs are trained to be close in embedding space, while mismatched pairs are pushed apart. The loss function uses InfoNCE loss, which maximizes the similarity of correct pairs among all image-text combinations within a batch. Larger batch sizes (= more comparison targets) lead to better training, which is why the original CLIP paper used a batch size of 32,768. Reproducing this number as an individual is practically impossible, which is why most people use pre-trained CLIP weights.
Masked Modeling โ An extension of BERT's text word masking to image regions. The model learns cross-modal relationships by reconstructing masked portions.
VQA Pre-training โ Direct training with question-answer pairs about images. Focused on building reasoning ability rather than matching.
Instruction Tuning โ The fine-tuning stage where the model learns to follow user instructions like "summarize this chart" or "find something unusual in the photo." Without this process, the model can only do basic captioning and cannot function as an interactive assistant.
One note about training data scale: Qwen2-VL used billions of image-text tokens, and this level of computing cost is beyond what individuals or small teams can afford. This is why "using" multimodal models rather than "building" them is the realistic choice.
Video Is a Different Problem
I'll keep this section brief. I have no hands-on experience with video multimodal, so I'll just convey the main points from the original article.
An image is a single snapshot, while video adds a temporal dimension. At 30fps, one minute equals 1,800 frames, and since processing all of them is impossible, a key frame extraction strategy is needed, along with solving the temporal misalignment problem between narration and scenes.
Currently, text+image multimodal is already at commercial quality with Gemini and GPT-4o, but the original article's assessment โ and I agree โ is that most video understanding models are still closer to the research stage.
Multimodal RAG โ This Is Where the Real Money Is
This section will be the longest, and for good reason. We're evaluating a feature for aickyway's next update that would automatically link related technical documents to user-uploaded images, and I did a deep dive into multimodal RAG during that process.
RAG itself is "an approach where AI searches for and references relevant materials before generating an answer." Instead of cramming all knowledge into model parameters, it retrieves necessary evidence from external sources at inference time. Multimodal RAG extends the search targets beyond text to include images, diagrams, and video frames.
The flow works like this:
- Collect text, images, video key frames, and metadata
- Convert each into vectors and store them in a vector DB (Weaviate, Milvus, Pinecone, etc.)
- When a user query comes in, search for relevant materials regardless of modality
- Pass search results to a multimodal LLM for evidence-based answer generation
It sounds simple on paper, but in practice, you get stuck at step 2. Cross-modal retrieval requires placing text and images in the same embedding space, which isn't as easy as it sounds. CLIP embeddings work reasonably well for image-text matching, but precision drops significantly with long documents or technical diagrams.
What we learned from testing is that rather than forcing image and text embeddings into the same space, a "two-stage approach" where you first generate captions or descriptions for images and then search based on that text is more stable at this point. It's a workaround compared to pure cross-modal retrieval, but it's a practical compromise given current embedding model limitations.
The reason multimodal RAG is better than "stuffing everything into the prompt" is clear. Context windows have limits (even 128K tokens fills up with just a few high-resolution images), and there's the "Lost in the Middle" phenomenon where models miss information in the middle of long contexts. The RAG approach of precisely extracting and passing only the necessary evidence is superior in terms of quality.

Practical Use Cases
I'll keep this brief. Here are the areas where multimodal LLMs actually work in practice right now:
Automatic document, form, and table interpretation โ This has the biggest impact. There used to be a pipeline of extracting text with OCR (technology for extracting characters from images) โ parsing with regex, and anyone who's used Korean OCR knows how terrible the accuracy is. Multimodal models can interpret receipt or form images directly when fed whole. Considering the time spent post-processing broken OCR results, the productivity difference is significant.
VQA and visual reasoning โ Chart analysis, diagram interpretation, identifying object relationships in images. If you've used Gemini Pro Vision or GPT-4o, you've already experienced this.
Industrial applications โ Combining medical imaging with patient records for diagnostic assistance, automatic defect classification on manufacturing lines, etc. However, accuracy requirements are high in these areas, so standalone use without human review is still rare.
Deployment โ This Is Where I Struggled the Most
I'm running the aickyway backend on a single RTX 4090 24GB GPU server, and here's what I experienced while testing multimodal model serving.
When I first loaded LLaVA-1.5 7B, text-only inference was stable at around 14GB VRAM. But once I started feeding images, each image added 576 tokens from the vision encoder output, and with just 2-3 concurrent requests, VRAM hit 24GB and OOM (Out of Memory) crashed everything. Lowering image resolution helps to some extent, but then fine-grained recognition quality drops, defeating the purpose.
Batching was also a problem. Text-only models have relatively uniform input lengths, making batching straightforward, but multimodal requests have extreme token count variations depending on whether images are included. When an image-free request (200 tokens) and a 3-image request (2,000+ tokens) end up in the same batch, the short request has to wait until the long one finishes. Naive static batching simply cannot solve this problem.
That's when we adopted vLLM, and the difference was clear.
What Exactly Does vLLM Solve?
vLLM is an open-source LLM inference engine from UC Berkeley's Sky Computing Lab. It has several key technologies that directly address the problems described above.
PagedAttention โ The core solution for VRAM OOM issues. The biggest memory consumer in LLM inference is the KV cache (memory storing attention keys/values from previous tokens), and the conventional approach pre-allocates memory for the maximum sequence length. This means even a 100-token request consumes memory for 2,048 tokens. PagedAttention allocates memory in page units (typically 16 tokens) on demand, similar to OS virtual memory. The original paper reported up to 97% reduction in memory waste, and in practice, the number of concurrent requests that could be handled on the same GPU roughly doubled or tripled.
Continuous Batching โ The technology that solves the batching problem mentioned above. When one request in a batch finishes, a new request is immediately placed in the empty slot. Short image-free requests no longer have to wait for 3-image requests to complete. The throughput improvement is noticeable.
Speculative Decoding โ A small draft model predicts multiple tokens ahead, and the main model verifies them all at once. If predictions are correct, multiple tokens are processed in a single step, increasing speed; if wrong, it restarts from that point. This is effective for text generation speed, but for multimodal models, image preprocessing time is often the bottleneck, so the perceived effect varies by situation.
OpenAI-Compatible API โ You can run code originally written for the OpenAI API almost unchanged by just switching the endpoint URL. This significantly reduces migration costs.
vLLM has recently been expanding support for vision-language models like LLaVA and Qwen2-VL, but stability is still lower compared to text-only models. In particular, image input format compatibility and preprocessing pipeline differences between multimodal models require considerable configuration adjustments when switching models. Still, it's a much better choice than writing your own serving code.

Risks
I'll keep this brief.
Privacy risks spike the moment you process images. Masking names or numbers in text is easy, but automatically detecting and blurring faces or ID cards in images requires a separate pipeline. For services like aickyway where users freely upload images, this needs to be reflected in the initial design. We're in the process of enhancing our automatic filtering logic for uploaded images.
Training data bias operates simultaneously on both text and image fronts, and the potential for deepfake misuse continues to grow alongside multimodal technology advancement. Using model outputs for final decision-making without human review in high-risk domains (medical, legal, financial) is still premature.
Closing Thoughts โ The Outlook from aickyway's Perspective
The thought that came up most while writing this article was "how will this affect the AI image generation community?"
The current basic workflow for AI image generation is text โ image, one-directional. You write a prompt, get results, tweak the prompt if unsatisfied, and regenerate. Technologies like IP-Adapter and ControlNet opened the door to using images as reference inputs, but these are still separate pipelines. You have to open a separate ControlNet tab in Stable Diffusion WebUI, select a preprocessing model, and upload images separately.
As multimodal LLMs mature, there's a high likelihood this process will be integrated into a conversational format. Natural language instructions like "keep this character's pose but change the background to cyberpunk" processed alongside image inputs in a single step. GPT-4o's image generation feature is already moving toward accepting mixed text+image inputs, and I believe it's only a matter of time before this reaches the local model ecosystem.
Realistically though, the computational resources needed to process multiple high-resolution images in real-time while simultaneously generating new ones are substantial. Serving this from a single 4090 is impractical โ you'd need at least A100-class GPUs, which completely changes the cost structure. Whether using commercial APIs or self-hosted infrastructure, serving efficiency is likely to become the bottleneck of this ecosystem. That's why tools like vLLM are important.
Personally, I predict that within a year, workflows where you generate images from mixed image+text inputs and give natural language modification instructions on the results will become mainstream in local environments. The question isn't "is it possible" but "will it run on average users' GPUs" โ and given the pace of quantization technology advancement, that's not an impossible prospect either.