Multimodal AI Explained: How Models Process Text, Images, Audio, and Video
Multimodal AI explained — how models like GPT-4o and Gemini process text, images, audio, and video together, with practical examples and real-world applications.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Multimodal AI Explained: How Models Process Text, Images, Audio, and Video
The moment GPT-4V (Vision) launched, I uploaded a photo of a whiteboard full of technical diagrams and asked it to summarize the architecture. It not only identified the system components — it noticed a logical inconsistency in the flow arrows I hadn't seen after staring at it for an hour.
That's the promise of multimodal AI: not just seeing images, but reasoning across them. Understanding context that spans text and visual information simultaneously. By 2025, the best models process text, images, audio, and video together in ways that genuinely unlock new use cases.
Here's how it works and how to use it.
The Architecture: How Models See
From Pixels to Tokens
LLMs work with tokens. To process images, you need to convert pixels into something token-like:
Image Processing Pipeline:
1. Input image (e.g., 224×224 pixels, RGB)
↓
2. Split into patches (e.g., 16×16 pixel patches)
→ For 224×224 image: 196 patches
↓
3. Linear projection: each patch → embedding vector
(similar to word embedding but for image patches)
↓
4. Add positional embeddings (tells model where each patch is)
↓
5. These "image tokens" are concatenated with text tokens
↓
6. Transformer processes all tokens together (joint attention)
→ Image tokens can attend to text tokens and vice versa
This Vision Transformer (ViT) approach, popularized by Google in 2020, is the foundation of most modern vision-language models.
Using GPT-4 Vision
import base64
from openai import OpenAI
client = OpenAI()
# Method 1: Image from URL
def analyze_image_url(image_url: str, question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url,
"detail": "high" # "low" = faster/cheaper, "high" = better OCR
}
},
{
"type": "text",
"text": question
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Method 2: Local image as base64
def analyze_local_image(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
# Get image media type
if image_path.endswith(".png"):
media_type = "image/png"
elif image_path.endswith((".jpg", ".jpeg")):
media_type = "image/jpeg"
else:
media_type = "image/webp"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:{media_type};base64,{image_data}"
}
},
{"type": "text", "text": question}
]
}
]
)
return response.choices[0].message.content
# Multi-image analysis
def compare_images(image_paths: list[str], comparison_question: str) -> str:
content = []
for path in image_paths:
with open(path, "rb") as f:
data = base64.b64encode(f.read()).decode("utf-8")
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{data}"}
})
content.append({"type": "text", "text": comparison_question})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}]
)
return response.choices[0].message.content
# Example: Document data extraction
invoice_text = analyze_local_image(
"invoice.pdf",
"Extract: invoice number, date, line items with prices, and total. Return as JSON."
)
Using Claude Vision
import anthropic
import base64
import httpx
client = anthropic.Anthropic()
def claude_vision(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{
"type": "text",
"text": question
}
],
}
],
)
return response.content[0].text
# Claude with image from URL
def claude_vision_url(image_url: str, question: str) -> str:
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}
},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
Open-Source Multimodal with LLaVA
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch
import requests
# LLaVA-1.6 (free, runs locally)
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def llava_analyze(image_path: str, question: str) -> str:
image = Image.open(image_path).convert("RGB")
# LLaVA conversation format
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question}
]
}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
do_sample=True
)
# Decode only the generated tokens (not the prompt)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
return processor.decode(generated, skip_special_tokens=True)
# Or use Ollama (simpler setup)
import ollama
def ollama_vision(image_path: str, question: str) -> str:
# ollama pull llava first
response = ollama.chat(
model="llava",
messages=[{
"role": "user",
"content": question,
"images": [image_path]
}]
)
return response["message"]["content"]
Audio and Video with Gemini
Gemini 1.5 Pro uniquely handles audio and video natively:
import google.generativeai as genai
genai.configure(api_key="your-gemini-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")
# Audio analysis
def analyze_audio(audio_path: str, question: str) -> str:
audio_file = genai.upload_file(path=audio_path, mime_type="audio/mp3")
response = model.generate_content([
question,
audio_file
])
return response.text
# Video analysis (up to ~1 hour)
def analyze_video(video_path: str, question: str) -> str:
video_file = genai.upload_file(path=video_path, mime_type="video/mp4")
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(2)
video_file = genai.get_file(video_file.name)
response = model.generate_content([
video_file,
question
])
return response.text
# Example: Video timestamp extraction
timestamps = analyze_video(
"lecture.mp4",
"List all topics covered with their timestamps in format MM:SS - Topic Name"
)
# Example: Multi-modal (image + audio)
scene_description = model.generate_content([
genai.upload_file("photo.jpg"),
genai.upload_file("ambient_sound.mp3"),
"Describe what's happening in this scene, incorporating both the visual and audio context."
])
Model Comparison: Multimodal Capabilities
| Model | Images | Audio | Video | OCR | Reasoning | Cost |
|---|---|---|---|---|---|---|
| GPT-4o | ✓ | ✓ | Limited | Excellent | Excellent | $5/$15/M |
| Claude 3.5 Sonnet | ✓ | ✗ | ✗ | Excellent | Excellent | $3/$15/M |
| Gemini 1.5 Pro | ✓ | ✓ | ✓ (1hr) | Good | Good | $3.5/$10.5/M |
| LLaVA-1.6 7B | ✓ | ✗ | ✗ | Good | Good | Free (local) |
| Phi-3-Vision | ✓ | ✗ | ✗ | Good | Good | Free (local) |
Practical Applications
Document Extraction Pipeline
def extract_invoice_data(invoice_path: str) -> dict:
"""Extract structured data from invoice image."""
result = analyze_local_image(
invoice_path,
"""Extract the following from this invoice and return as JSON:
{
"invoice_number": "...",
"date": "YYYY-MM-DD",
"vendor": "...",
"line_items": [{"description": "...", "quantity": N, "unit_price": N, "total": N}],
"subtotal": N,
"tax": N,
"total": N
}
Return ONLY valid JSON, no other text."""
)
import json
return json.loads(result)
# Chart analysis
def analyze_chart(chart_path: str) -> dict:
description = analyze_local_image(
chart_path,
"Analyze this chart: describe the type, axes, key trends, "
"data values (approximate), and main insights. Return as JSON."
)
return description
Conclusion
Multimodal AI is moving from novelty to infrastructure. Document processing that previously required specialized OCR pipelines now works via a single API call. Video analysis that required human review can be automated. Code can be generated from UI screenshots.
The practical guidance: for image+text tasks, any of GPT-4o, Claude 3.5 Sonnet, or Gemini delivers excellent results. For video and audio, Gemini 1.5 Pro is the only frontier model with native support. For privacy-sensitive workloads, LLaVA via Ollama runs locally with no data leaving your machine.
For building complete AI applications with multimodal capabilities, see our AI chatbot guide. For the transformer architecture underlying these vision models, see our transformer guide.
Frequently Asked Questions
What is multimodal AI?
Models that process multiple data types — text, images, audio, video — in a unified framework. Instead of separate specialized models, multimodal AI reasons across modalities simultaneously: analyzing a chart and answering questions, describing medical images, or generating code from screenshots.
How do vision-language models process images?
Images are split into patches (16×16 or 32×32 pixels), each patch is projected into an embedding vector, and these "image tokens" are processed alongside text tokens in the transformer. Attention mechanisms allow text tokens to attend to image tokens and vice versa, enabling cross-modal reasoning.
What can I use multimodal AI for in practice?
Document AI (extract data from invoices, forms, PDFs), code from screenshots, chart analysis, medical imaging analysis, video summarization, automated alt-text generation, product quality inspection, and accessibility tools. Any task requiring reasoning over visual + textual information together.
What is the difference between CLIP, LLaVA, and GPT-4 Vision?
CLIP: produces embeddings for image-text matching, not generative. LLaVA: open-source, connects CLIP encoder to LLaMA, runs locally, instruction-tuned. GPT-4o/Vision: proprietary, largest scale, best complex reasoning. Gemini: natively multimodal including video and audio.
How do I use multimodal models in Python?
OpenAI: pass images as base64 or URL in the messages content array. Claude: use image blocks with base64 source. Gemini: upload files via genai.upload_file(). Open-source: use transformers with AutoProcessor + LlavaForConditionalGeneration, or Ollama with the llava model for local inference.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.