In this guide, you will build a multi-modal agent capable of understanding both images and PDFs.
Multi-modal refers to the ability of the agent to understand and generate responses in multiple formats. In this guide, we'll focus on images and PDFs - two common document types that modern language models can process natively.
For a complete list of providers and their multi-modal capabilities, visit the providers documentation.
We'll build this agent using OpenAI's GPT-4o, but the same code works seamlessly with other providers - you can switch between them by changing just one line of code.
To follow this quickstart, you'll need:
If you haven't obtained your Vercel AI Gateway API key, you can do so by signing up on the Vercel website.
Start by creating a new Next.js application. This command will create a new directory named multi-modal-agent and set up a basic Next.js application inside it.
Be sure to select yes when prompted to use the App Router. If you are looking for the Next.js Pages Router quickstart guide, you can find it here.
pnpm create next-app@latest multi-modal-agent
Navigate to the newly created directory:
cd multi-modal-agent
Install ai and @ai-sdk/react, the AI SDK package and the AI SDK's React package respectively.
The AI SDK is designed to be a unified interface to interact with any large language model. This means that you can change model and providers with just one line of code! Learn more about available providers and building custom providers in the providers section.
pnpm add ai @ai-sdk/react
Create a .env.local file in your project root and add your Vercel AI Gateway API key. This key authenticates your application with Vercel AI Gateway.
touch .env.local
Edit the .env.local file:
AI_GATEWAY_API_KEY=your_api_key_hereReplace your_api_key_here with your actual Vercel AI Gateway API key.
The AI SDK's Vercel AI Gateway Provider is the default global provider, so you can access models using a simple string in the model configuration. If you prefer to use a specific provider like OpenAI directly, see the provider management documentation.
To build a multi-modal agent, you will need to:
Create a route handler, app/api/chat/route.ts and add the following code:
import { streamText, convertToModelMessages, type UIMessage } from 'ai';
// Allow streaming responses up to 30 secondsexport const maxDuration = 30;
export async function POST(req: Request) { const { messages }: { messages: UIMessage[] } = await req.json();
const result = streamText({ model: 'openai/gpt-4o', messages: await convertToModelMessages(messages), });
return result.toUIMessageStreamResponse();}Let's take a look at what is happening in this code:
POST request handler and extract messages from the body of the request. The messages variable contains a history of the conversation between you and the agent and provides the agent with the necessary context to make the next generation.convertToModelMessages, which transforms the UI-focused message format to the format expected by the language model.streamText, which is imported from the ai package. This function accepts a configuration object that contains a model provider and messages (converted in step 2). You can pass additional settings to further customize the model's behavior.streamText function returns a StreamTextResult. This result object contains the toUIMessageStreamResponse function which converts the result to a streamed response object.This Route Handler creates a POST request endpoint at /api/chat.
Now that you have a Route Handler that can query a large language model (LLM), it's time to setup your frontend. AI SDK UI abstracts the complexity of a chat interface into one hook, useChat.
Update your root page (app/page.tsx) with the following code to show a list of chat messages and provide a user message input:
'use client';
import { useChat } from '@ai-sdk/react';import { DefaultChatTransport } from 'ai';import { useState } from 'react';
export default function Chat() { const [input, setInput] = useState('');
const { messages, sendMessage } = useChat({ transport: new DefaultChatTransport({ api: '/api/chat', }), });
return ( <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch"> {messages.map(m => ( <div key={m.id} className="whitespace-pre-wrap"> {m.role === 'user' ? 'User: ' : 'AI: '} {m.parts.map((part, index) => { if (part.type === 'text') { return <span key={`${m.id}-text-${index}`}>{part.text}</span>; } return null; })} </div> ))}
<form onSubmit={async event => { event.preventDefault(); sendMessage({ role: 'user', parts: [{ type: 'text', text: input }], }); setInput(''); }} className="fixed bottom-0 w-full max-w-md mb-8 border border-gray-300 rounded shadow-xl" > <input className="w-full p-2" value={input} placeholder="Say something..." onChange={e => setInput(e.target.value)} /> </form> </div> );}Make sure you add the "use client" directive to the top of your file. This
allows you to add interactivity with JavaScript.
This page utilizes the useChat hook, configured with DefaultChatTransport to specify the API endpoint. The useChat hook provides multiple utility functions and state variables:
messages - the current chat messages (an array of objects with id, role, and parts properties).sendMessage - function to send a new message to the AI.parts array that can include text, images, PDFs, and other content types.To make your agent multi-modal, let's add the ability to upload and send both images and PDFs to the model. In v5, files are sent as part of the message's parts array. Files are converted to data URLs using the FileReader API before being sent to the server.
Update your root page (app/page.tsx) with the following code:
'use client';
import { useChat } from '@ai-sdk/react';import { DefaultChatTransport } from 'ai';import { useRef, useState } from 'react';import Image from 'next/image';
async function convertFilesToDataURLs(files: FileList) { return Promise.all( Array.from(files).map( file => new Promise<{ type: 'file'; mediaType: string; url: string; }>((resolve, reject) => { const reader = new FileReader(); reader.onload = () => { resolve({ type: 'file', mediaType: file.type, url: reader.result as string, }); }; reader.onerror = reject; reader.readAsDataURL(file); }), ), );}
export default function Chat() { const [input, setInput] = useState(''); const [files, setFiles] = useState<FileList | undefined>(undefined); const fileInputRef = useRef<HTMLInputElement>(null);
const { messages, sendMessage } = useChat({ transport: new DefaultChatTransport({ api: '/api/chat', }), });
return ( <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch"> {messages.map(m => ( <div key={m.id} className="whitespace-pre-wrap"> {m.role === 'user' ? 'User: ' : 'AI: '} {m.parts.map((part, index) => { if (part.type === 'text') { return <span key={`${m.id}-text-${index}`}>{part.text}</span>; } if (part.type === 'file' && part.mediaType?.startsWith('image/')) { return ( <Image key={`${m.id}-image-${index}`} src={part.url} width={500} height={500} alt={`attachment-${index}`} /> ); } if (part.type === 'file' && part.mediaType === 'application/pdf') { return ( <iframe key={`${m.id}-pdf-${index}`} src={part.url} width={500} height={600} title={`pdf-${index}`} /> ); } return null; })} </div> ))}
<form className="fixed bottom-0 w-full max-w-md p-2 mb-8 border border-gray-300 rounded shadow-xl space-y-2" onSubmit={async event => { event.preventDefault();
const fileParts = files && files.length > 0 ? await convertFilesToDataURLs(files) : [];
sendMessage({ role: 'user', parts: [{ type: 'text', text: input }, ...fileParts], });
setInput(''); setFiles(undefined);
if (fileInputRef.current) { fileInputRef.current.value = ''; } }} > <input type="file" accept="image/*,application/pdf" className="" onChange={event => { if (event.target.files) { setFiles(event.target.files); } }} multiple ref={fileInputRef} /> <input className="w-full p-2" value={input} placeholder="Say something..." onChange={e => setInput(e.target.value)} /> </form> </div> );}In this code, you:
convertFilesToDataURLs to convert file uploads to data URLs.useChat with DefaultChatTransport to specify the API endpoint.parts array structure, rendering text, images, and PDFs appropriately.onSubmit function to send messages with the sendMessage function, including both text and file parts.onChange handler to handle updating the files state.With that, you have built everything you need for your multi-modal agent! To start your application, use the command:
pnpm run dev
Head to your browser and open http://localhost:3000. You should see an input field and a button to upload files.
Try uploading an image or PDF and asking the model questions about it. Watch as the model's response is streamed back to you!
With the AI SDK's unified provider interface you can easily switch to other providers that support multi-modal capabilities:
// Using Anthropicconst result = streamText({ model: 'anthropic/claude-sonnet-4-20250514', messages: await convertToModelMessages(messages),});
// Using Googleconst result = streamText({ model: 'google/gemini-2.5-flash', messages: await convertToModelMessages(messages),});Install the provider package (@ai-sdk/anthropic or @ai-sdk/google) and update your API keys in .env.local. The rest of your code remains the same.
Different providers may have varying file size limits and performance characteristics. Check the provider documentation for specific details.
You've built a multi-modal AI agent using the AI SDK! Experiment and extend the functionality of this application further by exploring tool calling.