LLaVA models can analyze images by breaking them down into a grid of patches, embedding each patch, and then using a vision transformer to process these embeddings alongside text embeddings.
Let’s see LLaVA in action.
First, you need to have Ollama installed and running. If you haven’t already, download it from ollama.com.
Next, pull a LLaVA model. There are a few versions available, but llava is a good starting point.
ollama pull llava
Now, let’s run an interactive session. You’ll need an image file to test with. Save an image as cat_on_couch.jpg in your current directory.
ollama run llava
Once the prompt >>> appears, you can send an image and a question. Type the following, making sure to replace /path/to/your/cat_on_couch.jpg with the actual path to your image file:
/kaggle/input/cat_on_couch.jpg: What is the main subject of this image?
The model will then process the image and respond. You might see something like:
The main subject of the image is a cat sitting on a couch.
You can ask more specific questions:
/kaggle/input/cat_on_couch.jpg: Describe the cat's posture.
Response:
The cat is sitting upright on the couch, with its tail likely curled around its body or resting on the cushion.
Or even more detailed:
/kaggle/input/cat_on_couch.jpg: What is the color of the couch and the cat?
Response:
The couch appears to be a light color, possibly beige or cream. The cat is a darker color, with distinct markings that might include black, brown, and white patches.
This interactive mode is great for experimentation.
The core problem LLaVA solves is enabling large language models (LLMs) to "see" and interpret visual information. Traditional LLMs are text-only. LLaVA bridges this gap by combining a powerful vision encoder with an LLM.
Here’s a simplified look at how it works internally:
- Image Tokenization: The input image is divided into a grid of smaller, fixed-size patches. Think of it like cutting a picture into many small squares.
- Patch Embedding: Each image patch is converted into a numerical vector (an embedding) by a vision encoder, typically a pre-trained Vision Transformer (ViT). These embeddings represent the visual features of each patch.
- Projection Layer: A crucial component is a projection layer (often a simple linear layer or a small multi-layer perceptron) that maps the vision encoder’s patch embeddings into the same embedding space as the LLM’s text embeddings. This allows the LLM to "understand" the visual information as if it were text.
- Interleaving Embeddings: The projected image embeddings are then interleaved with text embeddings. If you ask a question about an image, the text of your question is also embedded. These visual and textual embeddings are fed into the LLM.
- LLM Processing: The LLM then processes this combined sequence of embeddings, attending to both the visual and textual information to generate a relevant text response. The LLM learns to correlate visual patterns with textual descriptions and concepts.
The "vision encoder" part of LLaVA is usually a pre-trained model like CLIP’s ViT. This encoder is already excellent at understanding images in isolation. The magic of LLaVA happens in how it connects this visual understanding to the LLM’s language capabilities. The projection layer is key because it translates the visual features into a format the LLM can work with, aligning their respective embedding spaces. Without this alignment, the LLM would just see random numbers from the image encoder.
When you use ollama run llava, Ollama handles loading the model weights, managing the VRAM, and orchestrating the interaction between the vision encoder and the LLM. It presents a simple command-line interface for you to input image paths and text prompts. The llava model itself is a fusion of a vision encoder and a language model, fine-tuned together on a dataset of image-caption pairs and visual question-answering tasks.
One thing that surprises many is how the model handles "out-of-distribution" images or prompts that require reasoning beyond simple object recognition. While LLaVA is impressive, its ability to infer complex relationships, understand abstract concepts, or perform highly nuanced reasoning in images is still developing. It excels at tasks that are well-represented in its training data, such as describing scenes, identifying objects, and answering factual questions about visual content. For tasks requiring deep common sense or inferential leaps, its performance might degrade. The quality of the image itself also plays a significant role; blurry or low-resolution images will naturally lead to less accurate interpretations.
The next step is often to explore different LLaVA model sizes and experiment with fine-tuning for domain-specific visual tasks.