arrow_backBACK TO BECOME AN AI ENGINEER — PRACTICAL GUIDE
Lesson 05Become an AI Engineer — Practical Guide10 min read

Build a Multi-modal Generation Agent

April 17, 2026

TL;DR

This lesson covers how image and video generation works — from VAEs and GANs to modern diffusion models. You'll learn the full text-to-image pipeline (data prep, U-Net/DiT, training, sampling, evaluation), text-to-video challenges, and then build a multi-modal generation agent that routes requests to the right provider and composes outputs.

Build a Multi-modal Generation Agent

So far you’ve built systems that work with text — chatbots, search agents, research tools. But the real world is multi-modal. Users want images, videos, and audio — not just text. This lesson teaches you how generative models produce visual and audio content, and then you’ll build an agent that orchestrates all of them.

Overview of Image and Video Generation

Four major architectures have powered generative AI for images and video:

Generative Model Architectures

VAE (Variational Autoencoder)

VAEs learn to compress images into a latent space and reconstruct them. They’re trained to make the latent space smooth and continuous, which allows interpolation between images.

# Conceptual VAE architecture
# Encoder: image → latent distribution (mean, variance)
# Decoder: sample from latent → reconstructed image
# Loss = reconstruction_loss + KL_divergence(latent, standard_normal)

VAEs produce blurry images on their own, but they’re a critical component of modern diffusion models — the “latent” in Latent Diffusion Models is a VAE latent space.

GANs (Generative Adversarial Networks)

Two networks in competition: a Generator creates fake images, a Discriminator tries to detect fakes. They train each other — the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes.

GANs produced the first photorealistic AI images (StyleGAN faces) but suffered from training instability and mode collapse (generating only a few types of images).

Auto-Regressive Models

Treat images as sequences of tokens — just like text. Tokenize the image into discrete codes (using a VQ-VAE), then predict the next token one at a time.

This is how GPT-4o and Gemini generate images natively — they use the same transformer architecture for both text and images. The advantage: unified multimodal models that can interleave text and images naturally.

Diffusion Models

The dominant approach today. The core idea is beautifully simple:

  1. Forward process: gradually add noise to an image until it’s pure random noise
  2. Train a model to predict and remove the noise at each step
  3. Generate: start from random noise and iteratively denoise to create an image

Diffusion models produce the highest quality images and are behind DALL·E 3, Stable Diffusion, Midjourney, and Flux.


Text-to-Image (T2I)

Let’s go deep on how text-to-image generation works in practice.

Text-to-Image Diffusion Pipeline

Data Preparation

Training a T2I model requires millions of image-caption pairs:

  1. Collection — Scrape the web for images with alt text, or use datasets like LAION-5B
  2. Quality filtering — Remove low-resolution, NSFW, duplicates, and aesthetically poor images
  3. Re-captioning — Original alt text is often terrible. Modern pipelines use LLMs (like LLaVA or CogVLM) to generate detailed, accurate captions
  4. Standardization — Resize to standard aspect ratio buckets, normalize pixel values
# Conceptual data preparation pipeline
import pandas as pd


def prepare_dataset(raw_data: pd.DataFrame) -> pd.DataFrame:
    """Filter and prepare image-caption pairs for training."""
    # Quality filtering
    filtered = raw_data[
        (raw_data["width"] >= 512) &
        (raw_data["height"] >= 512) &
        (raw_data["aesthetic_score"] >= 5.0) &
        (raw_data["nsfw_score"] < 0.1) &
        (raw_data["watermark_prob"] < 0.3)
    ]

    # Re-captioning with an LLM (conceptual)
    # For each image, generate a detailed caption:
    # "A golden retriever sitting on a red velvet couch in a warmly lit
    #  living room, soft bokeh background, professional photography"
    # Instead of: "dog on couch"

    return filtered

Diffusion Architectures

Two main architectures for the denoiser:

U-Net (used in SD 1.5, SD 2.1, DALL·E 2):

  • Encoder-decoder with skip connections
  • Cross-attention layers inject text embeddings
  • Processes at spatial resolution (e.g., 64×64 latent)

DiT — Diffusion Transformer (used in SD 3, Flux, DALL·E 3, Sora):

  • Replaces U-Net with a standard transformer
  • Patches image into tokens (like ViT)
  • Scales better with compute — transformers are well-understood
  • Currently the state-of-the-art
# The shift from U-Net to DiT
architecture_comparison = {
    "U-Net": {
        "used_by": ["SD 1.5", "SD 2.1", "DALL·E 2"],
        "pros": ["Proven", "Efficient at lower res"],
        "cons": ["Hard to scale", "Custom architecture"],
    },
    "DiT": {
        "used_by": ["SD 3", "Flux", "DALL·E 3", "Sora"],
        "pros": ["Scales with compute", "Standard transformer", "Better quality"],
        "cons": ["More compute at inference", "Newer, less tooling"],
    }
}

Diffusion Training

Forward process: Add Gaussian noise to the image over T timesteps:

x_t = √(ᾱ_t) · x_0 + √(1 - ᾱ_t) · ε     where ε ~ N(0, I)

Training objective: The model ε_θ learns to predict the noise ε that was added:

L = E_{t, x_0, ε}[ ‖ε - ε_θ(x_t, t, c)‖² ]

Where c is the text conditioning (CLIP or T5 text embedding).

Classifier-Free Guidance (CFG): At inference, amplify the text signal:

ε_guided = ε_uncond + w · (ε_cond - ε_uncond)

Higher w (guidance scale) = stronger adherence to the prompt, but less diversity.

Diffusion Sampling

Starting from noise, iteratively remove noise to generate an image:

Sampler Steps Speed Quality Notes
DDPM 1000 Very slow Excellent Original, theoretical foundation
DDIM 20-50 Moderate Good Deterministic, allows interpolation
DPM++ 2M 20-30 Fast Very good Popular default in SD
Euler 20-30 Fast Good Simple, reliable
LCM 4-8 Very fast Good Latent Consistency Models, distilled

Evaluation Metrics

How to measure if your T2I model is good:

Metric What It Measures Range Target
FID (Fréchet Inception Distance) Distribution similarity to real images 0–∞ Lower is better (< 10 is excellent)
IS (Inception Score) Image quality + diversity 1–∞ Higher is better
CLIP Score Image-text alignment 0–1 Higher = better prompt following
Aesthetic Score Human aesthetic preference 1–10 > 6 is good

Text-to-Video (T2V)

Video generation extends image generation into the temporal dimension — and that makes everything harder.

Latent Diffusion Modeling for Video

The key innovation: work in a compressed latent space rather than pixel space.

Video (H×W×T×3) → VAE Encoder → Latent (h×w×t×c) → Diffusion → VAE Decoder → Video

A 10-second, 720p, 30fps video is ~650 million pixels. Working in latent space compresses this by 64× or more.

Compression Networks

Video requires a 3D VAE that compresses both spatially (H,W) and temporally (T):

Component Spatial Compression Temporal Compression Total
Image VAE
Video VAE (basic) 32×
Video VAE (aggressive) 64×

Data Preparation for Video

Much harder than images:

# Video data preparation pipeline (conceptual)
video_prep_steps = [
    "1. Scene detection — split long videos into individual scenes/shots",
    "2. Motion filtering — remove static or too-shaky clips",
    "3. Quality filtering — resolution, compression artifacts, watermarks",
    "4. Caption generation — use video LLMs (Gemini, GPT-4o) for temporal descriptions",
    "5. Standardization — fixed FPS, resolution buckets, frame count",
    "6. Video latent caching — pre-encode all frames through VAE (saves training compute)",
]

DiT Architecture for Videos

The same DiT transformer works for video — just with additional temporal tokens:

  • Image: patch tokens arranged in a 2D grid
  • Video: patch tokens arranged in a 3D grid (H × W × T)
  • Temporal attention layers capture motion between frames

Large-Scale Training Challenges

Challenge Why It’s Hard Mitigation
Compute Training Sora-class models costs $10-100M+ Progressive training (low-res → high-res)
Data Video data is 100× larger than image data Pre-compute VAE latents, efficient data loading
Temporal coherence Frames must be consistent over time Joint spatial-temporal attention
Motion quality Physics, object permanence More training data, physics priors
Length Longer videos = exponentially more compute Hierarchical generation (keyframes → interpolation)

T2V’s Overall System

A production T2V system typically:

  1. Takes a text prompt + optional image/video input
  2. Enhances the prompt with an LLM for more detail
  3. Generates a short clip (2-10 seconds) with the diffusion model
  4. Optionally extends via video-to-video continuation
  5. Applies super-resolution for final output
  6. Runs safety filters

Project: Build the Multi-modal Generation Agent

Now let’s build an agent that can generate images, video, and audio by orchestrating multiple APIs.

Multi-modal Agent Architecture

Provider Clients

# providers.py
import os
import httpx
import base64
from openai import OpenAI
from pathlib import Path

client = OpenAI()


def generate_image_dalle(prompt: str, size: str = "1024x1024",
                         quality: str = "hd") -> dict:
    """Generate image with DALL·E 3."""
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        n=1,
    )
    return {
        "url": response.data[0].url,
        "revised_prompt": response.data[0].revised_prompt,
        "provider": "dall-e-3",
    }


def generate_image_flux(prompt: str) -> dict:
    """Generate image with Flux via Replicate."""
    response = httpx.post(
        "https://api.replicate.com/v1/predictions",
        headers={"Authorization": f"Bearer {os.getenv('REPLICATE_API_TOKEN')}"},
        json={
            "version": "black-forest-labs/flux-1.1-pro",
            "input": {"prompt": prompt, "aspect_ratio": "16:9"},
        },
        timeout=60,
    )
    result = response.json()
    # Poll for completion
    while result.get("status") not in ("succeeded", "failed"):
        import time
        time.sleep(2)
        poll = httpx.get(
            result["urls"]["get"],
            headers={"Authorization": f"Bearer {os.getenv('REPLICATE_API_TOKEN')}"},
        )
        result = poll.json()

    return {
        "url": result["output"][0] if result["output"] else None,
        "provider": "flux-1.1-pro",
    }


def generate_speech(text: str, voice: str = "nova") -> dict:
    """Generate speech with OpenAI TTS."""
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )
    output_path = "/tmp/speech_output.mp3"
    response.stream_to_file(output_path)
    return {
        "file_path": output_path,
        "provider": "openai-tts",
        "voice": voice,
    }


def generate_video_runway(prompt: str, duration: int = 5) -> dict:
    """Generate video with Runway Gen-3 Alpha (conceptual — API may differ)."""
    response = httpx.post(
        "https://api.rev.ai/runway/v1/generate",
        headers={"Authorization": f"Bearer {os.getenv('RUNWAY_API_KEY')}"},
        json={
            "prompt": prompt,
            "duration": duration,
            "model": "gen3a_turbo",
        },
        timeout=120,
    )
    result = response.json()
    return {
        "url": result.get("video_url"),
        "provider": "runway-gen3",
        "duration": duration,
    }

The Prompt Enhancer

Different models respond best to different prompt styles:

# prompt_enhancer.py
from openai import OpenAI

client = OpenAI()

ENHANCER_PROMPTS = {
    "image": """Enhance this image generation prompt. Make it highly descriptive with:
- Specific visual details (lighting, composition, colors)
- Style keywords (photorealistic, cinematic, illustration, etc.)
- Camera/lens details if photographic
- Mood and atmosphere
Keep it under 200 words. Only return the enhanced prompt.""",

    "video": """Enhance this video generation prompt. Include:
- Scene description with specific movements and actions
- Camera motion (pan, zoom, tracking shot, etc.)
- Temporal details (what happens first, then, finally)
- Visual style and mood
Keep it under 150 words. Only return the enhanced prompt.""",

    "speech": """You are preparing text for text-to-speech. Clean up the text:
- Add natural pauses with commas and periods
- Spell out numbers and abbreviations
- Remove markdown formatting
- Add emphasis markers if needed
Only return the cleaned text.""",
}


def enhance_prompt(prompt: str, modality: str) -> str:
    system = ENHANCER_PROMPTS.get(modality, ENHANCER_PROMPTS["image"])
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

The Orchestrator Agent

# multimodal_agent.py
import json
from openai import OpenAI
from providers import (
    generate_image_dalle, generate_image_flux,
    generate_speech, generate_video_runway,
)
from prompt_enhancer import enhance_prompt

client = OpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "generate_image",
            "description": "Generate an image from a text description. Use for any visual content creation.",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {"type": "string", "description": "Detailed image description"},
                    "provider": {
                        "type": "string",
                        "enum": ["dalle3", "flux"],
                        "description": "dalle3 for photorealistic, flux for artistic/creative",
                    },
                    "size": {"type": "string", "default": "1024x1024"},
                },
                "required": ["prompt", "provider"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "generate_video",
            "description": "Generate a short video clip from a text description.",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {"type": "string", "description": "Video scene description with motion"},
                    "duration": {"type": "integer", "default": 5, "description": "Duration in seconds (5-15)"},
                },
                "required": ["prompt"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "generate_speech",
            "description": "Convert text to natural-sounding speech audio.",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {"type": "string", "description": "Text to convert to speech"},
                    "voice": {
                        "type": "string",
                        "enum": ["alloy", "echo", "fable", "onyx", "nova", "shimmer"],
                        "default": "nova",
                    },
                },
                "required": ["text"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "enhance_prompt",
            "description": "Enhance a generation prompt for better results. Call before generating.",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {"type": "string"},
                    "modality": {"type": "string", "enum": ["image", "video", "speech"]},
                },
                "required": ["prompt", "modality"]
            }
        }
    },
]

SYSTEM_PROMPT = """You are a multi-modal content creation agent. You can generate images, videos, and speech audio.

## Your Process
1. Understand what the user wants to create
2. Enhance the prompt for the target modality
3. Choose the right provider and generate
4. Describe what was created

## Guidelines
- Always enhance prompts before generating for better quality
- For images: use dalle3 for photorealistic, flux for artistic/creative styles
- For video: keep descriptions action-oriented with camera movements
- For speech: clean up text for natural delivery
- If the user wants multiple outputs, generate them in sequence
- Describe the generated content so the user knows what to expect"""


def execute_tool(name: str, args: dict) -> str:
    if name == "generate_image":
        provider = args.get("provider", "dalle3")
        if provider == "dalle3":
            result = generate_image_dalle(args["prompt"], args.get("size", "1024x1024"))
        else:
            result = generate_image_flux(args["prompt"])
        return json.dumps(result)

    elif name == "generate_video":
        result = generate_video_runway(args["prompt"], args.get("duration", 5))
        return json.dumps(result)

    elif name == "generate_speech":
        result = generate_speech(args["text"], args.get("voice", "nova"))
        return json.dumps(result)

    elif name == "enhance_prompt":
        enhanced = enhance_prompt(args["prompt"], args["modality"])
        return json.dumps({"enhanced_prompt": enhanced})

    return json.dumps({"error": f"Unknown tool: {name}"})


class MultiModalAgent:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.generated_assets = []

    def create(self, request: str, max_steps: int = 10) -> dict:
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": request},
        ]
        self.generated_assets = []

        for step in range(max_steps):
            response = client.chat.completions.create(
                model=self.model,
                messages=messages,
                tools=TOOLS,
                tool_choice="auto",
            )
            msg = response.choices[0].message
            messages.append(msg)

            if msg.tool_calls:
                for tc in msg.tool_calls:
                    name = tc.function.name
                    args = json.loads(tc.function.arguments)
                    print(f"  [{step + 1}] {name}: {json.dumps(args)[:80]}...")

                    result = execute_tool(name, args)
                    result_data = json.loads(result)

                    if "url" in result_data or "file_path" in result_data:
                        self.generated_assets.append({
                            "type": name.replace("generate_", ""),
                            **result_data,
                        })

                    messages.append({
                        "role": "tool",
                        "tool_call_id": tc.id,
                        "content": result,
                    })
            else:
                return {
                    "response": msg.content,
                    "assets": self.generated_assets,
                }

        return {
            "response": "Generation complete.",
            "assets": self.generated_assets,
        }


if __name__ == "__main__":
    agent = MultiModalAgent()

    # Example: Generate a complete social media post
    result = agent.create(
        "Create a social media post about AI in healthcare. I need: "
        "1) A hero image showing a futuristic hospital with AI assistance, "
        "2) A short narration audio reading the post caption."
    )

    print(f"\nAgent response: {result['response']}")
    print(f"\nGenerated assets: {len(result['assets'])}")
    for asset in result["assets"]:
        print(f"  - {asset['type']}: {asset.get('url', asset.get('file_path', 'N/A'))}")

Adding a REST API

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
from multimodal_agent import MultiModalAgent

app = FastAPI(title="Multi-modal Generation Agent")
agent = MultiModalAgent()


class GenerateRequest(BaseModel):
    prompt: str
    max_steps: int = 10


@app.post("/generate")
async def generate(req: GenerateRequest):
    result = agent.create(req.prompt, req.max_steps)
    return result

Running It

# Install dependencies
pip install openai httpx replicate fastapi uvicorn

# Set API keys
export OPENAI_API_KEY=sk-your-key
export REPLICATE_API_TOKEN=r8-your-key

# Run
python multimodal_agent.py

Evaluation Metrics Cheat Sheet

Modality Metric What It Measures
Image FID Distribution quality
Image CLIP Score Text-image alignment
Image Aesthetic Score Human preference
Video FVD Video quality distribution
Video Temporal Consistency Frame-to-frame coherence
Video Motion Quality Realistic movement
Audio MOS (Mean Opinion Score) Human-rated speech quality
Audio WER (Word Error Rate) Speech intelligibility
All Human preference Side-by-side comparisons

Key Takeaways

  1. Diffusion models dominate image and video generation — they learn to denoise, and generate by starting from pure noise
  2. DiT (Diffusion Transformer) is replacing U-Net as the backbone — it scales better with compute, just like text transformers
  3. Data quality >> model size for T2I — filtering, re-captioning, and aesthetic scoring matter enormously
  4. Video is image generation + temporal coherence — latent diffusion in 3D with massive compute requirements
  5. Multi-modal agents orchestrate, not generate — the LLM chooses which specialist model to call and how to compose the outputs
  6. Prompt enhancement is critical — model-specific prompt optimization dramatically improves output quality
  7. The modality APIs are commoditizing — the value is in orchestration, prompt engineering, and compositing, not in running the models yourself

What’s Next

In the final lesson, you’ll tackle a Capstone Project that combines everything — LLM APIs, RAG, agents, reasoning, and multi-modal generation — into a single production-grade application.