One of the most frequent questions I get while running aickyway is "What does it actually mean when AI understands images?" If you've ever thrown a screenshot at Gemini and asked about an error, you know it doesn't just magically work. There's quite a complex architecture running behind the scenes.

I recently came across a blog post on QuarkAndCode that neatly covered this topic from architecture to deployment (original: "Multimodal LLMs Guide: Text, Image & Video, RAG Search & vLLM", 2024.12.31). I took some notes while reading, and I'll try to break it down based on what I know. The training theory section leans heavily on the original article since I've never implemented the papers myself, while the serving and RAG sections will be longer since I have hands-on experience with those.

A person working in front of dual monitors showing an AI chat interface with an uploaded image and server logs


Quick Terminology

There are many English abbreviations throughout the article, so let me define them once here to keep things concise.

Modality โ€” A type of data. Text, images, video, and audio are each a modality. A multimodal model processes two or more of these simultaneously.

Embedding โ€” Data converted into numerical vectors that AI can compute with. Similar meanings are positioned close together in vector space.

VQA (Visual Question Answering) โ€” A task where the model answers questions about an image.

RAG (Retrieval-Augmented Generation) โ€” An approach where relevant materials are retrieved before generating an answer.

The rest will be explained when they first appear in the text.


Internal Architecture of Multimodal LLMs

To cut to the chase, the architecture itself is simple. Vision Encoder โ†’ Projector โ†’ LLM. Three blocks.

The vision encoder converts an image into feature vectors, the projector transforms those vectors into token formats the LLM can digest, and the LLM combines them with text tokens for inference.

The projector is surprisingly important here, and LLaVA is a good example. The original LLaVA (v1) used a simple single linear layer as its projector. When LLaVA-1.5 switched to a 2-layer MLP (Multi-Layer Perceptron), benchmark scores improved significantly. How well the projector "translates" image features into the language model's token space directly affects overall performance. In many cases, improving the projector is more cost-effective than changing the vision encoder itself.