Multimodal AI Explained: How Models Process Text, Images, Audio, and Video
Multimodal AI explained — how models like GPT-4o and Gemini process text, images, audio, and video together, with practical examples and real-world applications.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Multimodal AI Explained: How Models Process Text, Images, Audio, and Video
The moment GPT-4V (Vision) launched, I uploaded a photo of a whiteboard full of technical diagrams and asked it to summarize the architecture. It not only identified the system components — it noticed a logical inconsistency in the flow arrows I hadn't seen after staring at it for an hour.
That's the promise of multimodal AI: not just seeing images, but reasoning across them. Understanding context that spans text and visual information simultaneously. By 2025, the best models process text, images, audio, and video together in ways that genuinely unlock new use cases.
Here's how it works and how to use it.
The Architecture: How Models See
From Pixels to Tokens
LLMs work with tokens. To process images, you need to convert pixels into something token-like:
Image Processing Pipeline:
1. Input image (e.g., 224×224 pixels, RGB)
↓
2. Split into patches (e.g., 16×16 pixel patches)
→ For 224×224 image: 196 patches
↓
3. Linear projection: each patch → embedding vector
(similar to word embedding but for image patches)
↓
4. Add positional embeddings (tells model where each patch is)
↓
5. These "image tokens" are concatenated with text tokens
↓
6. Transformer processes all tokens together (joint attention)
→ Image tokens can attend to text tokens and vice versa
This Vision Transformer (ViT) approach, popularized by Google in 2020, is the foundation of most modern vision-language models.
Using GPT-4 Vision
import base64
from openai import OpenAI
client = OpenAI()
# Method 1: Image from URL
def analyze_image_url(image_url: str, question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url,
"detail": "high" # "low" = faster/cheaper, "high" = better OCR
}
},
{
"type": "text",
"text": question
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Method 2: Local image as base64
def analyze_local_image(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
# Get image media type
if image_path.endswith(".png"):
media_type = "image/png"
elif image_path.endswith((".jpg", ".jpeg")):
media_type = "image/jpeg"
else:
media_type = "image/webp"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:{media_type};base64,{image_data}"
}
},
{"type": "text", "text": question}
]
}
]
)
return response.choices[0].message.content
# Multi-image analysis
def compare_images(image_paths: list[str], comparison_question: str) -> str:
content = []
for path in image_paths:
with open(path, "rb") as f:
data = base64.b64encode(f.read()).decode("utf-8")
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{data}"}
})
content.append({"type": "text", "text": comparison_question})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}]
)
return response.choices[0].message.content
# Example: Document data extraction
invoice_text = analyze_local_image(
"invoice.pdf",
"Extract: invoice number, date, line items with prices, and total. Return as JSON."
)
Using Claude Vision
import anthropic
import base64
import httpx
client = anthropic.Anthropic()
def claude_vision(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{
"type": "text",
"text": question
}
],
}
],
)
return response.content[0].text
# Claude with image from URL
def claude_vision_url(image_url: str, question: str) -> str:
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}
},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
Open-Source Multimodal with LLaVA
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch
import requests
# LLaVA-1.6 (free, runs locally)
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def llava_analyze(image_path: str, question: str) -> str:
image = Image.open(image_path).convert("RGB")
# LLaVA conversation format
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question}
]
}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
do_sample=True
)
# Decode only the generated tokens (not the prompt)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
return processor.decode(generated, skip_special_tokens=True)
# Or use Ollama (simpler setup)
import ollama
def ollama_vision(image_path: str, question: str) -> str:
# ollama pull llava first
response = ollama.chat(
model="llava",
messages=[{
"role": "user",
"content": question,
"images": [image_path]
}]
)
return response["message"]["content"]
Audio and Video with Gemini
Gemini 1.5 Pro uniquely handles audio and video natively:
import google.generativeai as genai
genai.configure(api_key="your-gemini-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")
# Audio analysis
def analyze_audio(audio_path: str, question: str) -> str:
audio_file = genai.upload_file(path=audio_path, mime_type="audio/mp3")
response = model.generate_content([
question,
audio_file
])
return response.text
# Video analysis (up to ~1 hour)
def analyze_video(video_path: str, question: str) -> str:
video_file = genai.upload_file(path=video_path, mime_type="video/mp4")
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(2)
video_file = genai.get_file(video_file.name)
response = model.generate_content([
video_file,
question
])
return response.text
# Example: Video timestamp extraction
timestamps = analyze_video(
"lecture.mp4",
"List all topics covered with their timestamps in format MM:SS - Topic Name"
)
# Example: Multi-modal (image + audio)
scene_description = model.generate_content([
genai.upload_file("photo.jpg"),
genai.upload_file("ambient_sound.mp3"),
"Describe what's happening in this scene, incorporating both the visual and audio context."
])
Model Comparison: Multimodal Capabilities
| Model | Images | Audio | Video | OCR | Reasoning | Cost |
|---|---|---|---|---|---|---|
| GPT-4o | ✓ | ✓ | Limited | Excellent | Excellent | $5/$15/M |
| Claude 3.5 Sonnet | ✓ | ✗ | ✗ | Excellent | Excellent | $3/$15/M |
| Gemini 1.5 Pro | ✓ | ✓ | ✓ (1hr) | Good | Good | $3.5/$10.5/M |
| LLaVA-1.6 7B | ✓ | ✗ | ✗ | Good | Good | Free (local) |
| Phi-3-Vision | ✓ | ✗ | ✗ | Good | Good | Free (local) |
Practical Applications
Document Extraction Pipeline
def extract_invoice_data(invoice_path: str) -> dict:
"""Extract structured data from invoice image."""
result = analyze_local_image(
invoice_path,
"""Extract the following from this invoice and return as JSON:
{
"invoice_number": "...",
"date": "YYYY-MM-DD",
"vendor": "...",
"line_items": [{"description": "...", "quantity": N, "unit_price": N, "total": N}],
"subtotal": N,
"tax": N,
"total": N
}
Return ONLY valid JSON, no other text."""
)
import json
return json.loads(result)
# Chart analysis
def analyze_chart(chart_path: str) -> dict:
description = analyze_local_image(
chart_path,
"Analyze this chart: describe the type, axes, key trends, "
"data values (approximate), and main insights. Return as JSON."
)
return description
Conclusion
Multimodal AI is moving from novelty to infrastructure. Document processing that previously required specialized OCR pipelines now works via a single API call. Video analysis that required human review can be automated. Code can be generated from UI screenshots.
The practical guidance: for image+text tasks, any of GPT-4o, Claude 3.5 Sonnet, or Gemini delivers excellent results. For video and audio, Gemini 1.5 Pro is the only frontier model with native support. For privacy-sensitive workloads, LLaVA via Ollama runs locally with no data leaving your machine.
For building complete AI applications with multimodal capabilities, see our AI chatbot guide. For the transformer architecture underlying these vision models, see our transformer guide.
Further Reading
- Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)
- Fine-Tuning LLMs: When to Do It and How to Do It Right
- Transformer Architecture Explained: The Architecture Behind All Modern AI
- LLM Context Window Explained: Why It Matters and How to Use It
- Embeddings Explained: How AI Converts Words to Numbers That Mean Something
- How to Fine-Tune ChatGPT for Your Brand Voice
- How to Write ChatGPT System Prompts for Consistent Output
- Hugging Face Transformers Tutorial: Complete Guide to Using Pretrained Models
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.