AI Gateway is a proxy service from Vercel designed to work with AI SDK 5. It makes it fast and easy to use a wide range of AI models across providers. Use the AI Gateway provider with any model available in the library below and Vercel handles the rest.
Note: AI Gateway is currently in alpha release and not yet ready for production use. Usage is subject to rate limits based on your Vercel plan, with limits refreshing every 24 hours. Your use of this early preview is subject to Vercel's Public Beta Agreement and AI Policy.
Follow the instructions at one of the demo apps: Next.js Demo, Svelte Demo or you can create a new app and install AI SDK 5 Alpha with the AI Gateway provider module by creating a new Next.js app.
pnpm dlx create-next-app@latest my-ai-app
And then installing the AI SDK 5 Alpha and AI Gateway provider:
pnpm add ai@alpha @vercel/ai-sdk-gateway
Your Vercel project has an OIDC token associated with it. This token is used for authenticating with the AI Gateway service. During local development, run your dev server with the following command for auto-refreshing authentication tokens:
vc dev
Alternatively, you can run your dev server as you normally do and just pull your environment variables when the token expires (every 12 hours).
vc env pull
In a Vercel deployment, the project's OIDC token is automatically available. Just deploy your project as normal and you can use your desired AI models.
You can use the AI Gateway provider in your app as you would any other AI SDK provider. Simple streaming looks something like:
import { gateway } from "@vercel/ai-sdk-gateway";import { streamText } from "ai";
const result = streamText({ model: gateway("xai/grok-3-beta"), prompt: "Tell me the history of the San Francisco Mission-style burrito.", onError: (error: unknown) => { console.error(error); },});
We'd love to hear your feedback with the Feedback button above, or on X.
anthropic/claude-v3-opus
Claude 3 Opus is Anthropic's most intelligent model, with best-in-market performance on highly complex tasks. It can navigate open-ended prompts and sight-unseen scenarios with remarkable fluency and human-like understanding. Opus shows us the outer limits of what's possible with generative AI.
anthropic/claude-v3.5-sonnet
Claude 3.5 Sonnet strikes the ideal balance between intelligence and speed—particularly for enterprise workloads. It delivers strong performance at a lower cost compared to its peers, and is engineered for high endurance in large-scale AI deployments.
anthropic/claude-v3-haiku
Claude 3 Haiku is Anthropic's fastest model yet, designed for enterprise workloads which often involve longer prompts. Haiku to quickly analyze large volumes of documents, such as quarterly filings, contracts, or legal cases, for half the cost of other models in its performance tier.
anthropic/claude-3.5-haiku
Claude 3.5 Haiku is the next generation of our fastest model. For a similar speed to Claude 3 Haiku, Claude 3.5 Haiku improves across every skill set and surpasses Claude 3 Opus, the largest model in our previous generation, on many intelligence benchmarks.
anthropic/claude-3.7-sonnet
Claude 3.7 Sonnet is the first hybrid reasoning model and Anthropic's most intelligent model to date. It delivers state-of-the-art performance for coding, content generation, data analysis, and planning tasks, building upon its predecessor Claude 3.5 Sonnet's capabilities in software engineering and computer use.
anthropic/claude-3.7-sonnet-reasoning
Claude 3.7 Sonnet is the first hybrid reasoning model and Anthropic's most intelligent model to date. It delivers state-of-the-art performance for coding, content generation, data analysis, and planning tasks, building upon its predecessor Claude 3.5 Sonnet's capabilities in software engineering and computer use.
bedrock/amazon.nova-micro-v1:0
bedrock/claude-3-7-sonnet-20250219
Claude 3.7 Sonnet is Anthropic's most intelligent model to date and the first Claude model to offer extended thinking—the ability to solve complex problems with careful, step-by-step reasoning. Anthropic is the first AI lab to introduce a single model where users can balance speed and quality by choosing between standard thinking for near-instant responses or extended thinking or advanced reasoning. Claude 3.7 Sonnet is state-of-the-art for coding, and delivers advancements in computer use, agentic capabilities, complex reasoning, and content generation. With frontier performance and more control over speed, Claude 3.7 Sonnet is the ideal choice for powering AI agents, especially customer-facing agents, and complex AI workflows.
bedrock/claude-3-5-haiku-20241022
Claude 3 Haiku is Anthropic's fastest, most compact model for near-instant responsiveness. It answers simple queries and requests with speed. Customers will be able to build seamless AI experiences that mimic human interactions. Claude 3 Haiku can process images and return text outputs, and features a 200K context window.
bedrock/claude-3-5-sonnet-20241022-v2
The upgraded Claude 3.5 Sonnet is now state-of-the-art for a variety of tasks including real-world software engineering, agentic capabilities and computer use. The new Claude 3.5 Sonnet delivers these advancements at the same price and speed as its predecessor.
bedrock/claude-3-5-sonnet-20240620-v1
bedrock/claude-3-haiku-20240307-v1
Claude 3 Haiku is Anthropic's fastest, most compact model for near-instant responsiveness. It answers simple queries and requests with speed. Customers will be able to build seamless AI experiences that mimic human interactions. Claude 3 Haiku can process images and return text outputs, and features a 200K context window.
bedrock/meta.llama4-maverick-17b-instruct-v1
bedrock/meta.llama4-scout-17b-instruct-v1
Llama 4 Scout is the best multimodal model in the world in its class and is more powerful than our Llama 3 models, while fitting in a single H100 GPU. Additionally, Llama 4 Scout supports an industry-leading context window of up to 10M tokens.
bedrock/meta.llama3-3-70b-instruct-v1
Where performance meets efficiency. This model supports high-performance conversational AI designed for content creation, enterprise applications, and research, offering advanced language understanding capabilities, including text summarization, classification, sentiment analysis, and code generation.
bedrock/meta.llama3-2-11b-instruct-v1
bedrock/meta.llama3-2-1b-instruct-v1
bedrock/meta.llama3-2-3b-instruct-v1
bedrock/meta.llama3-2-90b-instruct-v1
bedrock/meta.llama3-1-70b-instruct-v1
bedrock/meta.llama3-1-8b-instruct-v1
bedrock/deepseek.r1-v1
cerebras/llama-4-scout-17b-16e-instruct
The Llama-4-Scout-17B-16E-Instruct model is a state-of-the-art, instruction-tuned, multimodal AI model developed by Meta as part of the Llama 4 family. It is designed to handle both text and image inputs, making it suitable for a wide range of applications, including conversational AI, code generation, and visual reasoning.
cerebras/llama3.1-8b
Llama 3.1 8B brings powerful performance in a smaller, more efficient package. With improved multilingual support, tool use, and a 128K context length, it enables sophisticated use cases like interactive agents and compact coding assistants while remaining lightweight and accessible.
cerebras/llama-3.3-70b
The upgraded Llama 3.1 70B model features enhanced reasoning, tool use, and multilingual abilities, along with a significantly expanded 128K context window. These improvements make it well-suited for demanding tasks such as long-form summarization, multilingual conversations, and coding assistance.
cerebras/deepseek-r1-distill-llama-70b
DeepSeek-R1 is a state-of-the-art reasoning model trained with reinforcement learning and cold-start data, delivering strong performance across math, code, and complex reasoning tasks. It offers improved stability, readability, and multilingual handling compared to earlier versions, and is available alongside several high-quality distilled variants.
cerebras/qwen-3-32b
Qwen3-32B is a world-class model with comparable quality to DeepSeek R1 while outperforming GPT-4.1 and Claude Sonnet 3.7. It excels in code-gen, tool-calling, and advanced reasoning, making it an exceptional model for a wide range of production use cases.
Command A is Cohere's most performant model to date, excelling at tool use, agents, retrieval augmented generation (RAG), and multilingual use cases. Command A has a context length of 256K, only requires two GPUs to run, and has 150% higher throughput compared to Command R+ 08-2024.
cohere/command-r-plus
Command R+ is Cohere's newest large language model, optimized for conversational interaction and long-context tasks. It aims at being extremely performant, enabling companies to move beyond proof of concept and into production.
Command R is a large language model optimized for conversational interaction and long context tasks. It targets the "scalable" category of models that balance high performance with strong accuracy, enabling companies to move beyond proof of concept and into production.
deepinfra/llama-4-maverick-17b-128e-instruct-fp8
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts. Served by DeepInfra.
deepinfra/llama-4-scout-17b-16e-instruct
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts. Served by DeepInfra.
deepinfra/qwen3-235b-a22b
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support
deepinfra/qwen3-30b-a3b
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support
deepseek/chat
DeepSeek-V3 is an open-source large language model that builds upon LLaMA (Meta's foundational language model) to enable versatile functionalities such as text generation, code completion, and more, served by Fireworks AI.
deepseek/deepseek-r1
DeepSeek Reasoner is a specialized model developed by DeepSeek that uses Chain of Thought (CoT) reasoning to improve response accuracy. Before providing a final answer, it generates detailed reasoning steps that are accessible through the API, allowing users to examine and leverage the model's thought process, served by Fireworks AI.
fireworks/firefunction-v1
fireworks/mixtral-8x22b-instruct
fireworks/mixtral-8x7b-instruct
fireworks/qwen3-235b-a22b
groq/llama-3.2-1b
The Llama 3.2, 1 billion parameter multi-lingual text only model is made by Meta. It is lightweight and can be run everywhere on mobile and on edge devices. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/llama-3.3-70b-versatile
The Meta Llama 3.3 multilingual model is a pretrained and instruction tuned generative model with 70B parameters. Optimized for multilingual dialogue use cases, it outperforms many of the available open source and closed chat models on common industry benchmarks. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/llama-3.1-8b
Llama 3.1 8B with 128K context window support, making it ideal for real-time conversational interfaces and data analysis while offering significant cost savings compared to larger models. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/llama-3-8b-instruct
Llama is a 8 billion parameter open source model by Meta fine-tuned for instruction following purposes. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/llama-3-70b-instruct
Llama is a 70 billion parameter open source model by Meta fine-tuned for instruction following purposes. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/gemma2-9b-it
groq/deepseek-r1-distill-llama-70b
DeepSeek-R1-Distill-Llama-70B is a distilled, more efficient variant of the 70B Llama model. It preserves strong performance across text-generation tasks, reducing computational overhead for easier deployment and research. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/mistral-saba-24b
Mistral Saba 24B is a 24 billion parameter open source model by Mistral.ai. Saba is a specialized model trained to excel in Arabic, Farsi, Urdu, Hebrew, and Indic languages. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
Qwen QWQ-32B is a powerful large language model with strong reasoning capabilities and versatile applications across various tasks. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
groq/llama-4-scout-17b-16e-instruct
Llama 4 Scout is Meta's natively multimodal model with a 17B parameter mixture-of-experts architecture (16 experts), offering exceptional performance across text and image understanding with support for 12 languages, optimized for assistant-like chat, image recognition, and coding tasks. Served by Groq with their custom Language Processing Units (LPUs) hardware to provide fast and efficient inference.
inception/mercury-coder-small
mistral/mistral-large
mistral/mistral-small
mistral/codestral-2501
Mistral Codestral 25.01 is a state-of-the-art coding model optimized for low-latency, high-frequency use cases. Proficient in over 80 programming languages, it excels at tasks like fill-in-the-middle (FIM), code correction, and test generation.
mistral/pixtral-12b-2409
mistral/ministral-3b-latest
mistral/ministral-8b-latest
mistral/pixtral-large-latest
Pixtral Large is the second model in our multimodal family and demonstrates frontier-level image understanding. Particularly, the model is able to understand documents, charts and natural images, while maintaining the leading text-only understanding of Mistral Large 2.
mistral/mistral-small-2503
Mistral Small 3.1 is a state-of-the-art multimodal and multilingual model with excellent benchmark performance while delivering 150 tokens per second inference speeds and supporting up to 128k context window.
OpenAI's o3 is their most powerful reasoning model, setting new state-of-the-art benchmarks in coding, math, science, and visual perception. It excels at complex queries requiring multi-faceted analysis, with particular strength in analyzing images, charts, and graphics.
openai/gpt-4.1-mini
openai/gpt-4.1-nano
GPT-4o from OpenAI has broad general knowledge and domain expertise allowing it to follow complex instructions in natural language and solve difficult problems accurately. It matches GPT-4 Turbo performance with a faster and cheaper API.
openai/gpt-4o-mini
GPT-4o mini from OpenAI is their most advanced and cost-efficient small model. It is multi-modal (accepting text or image inputs and outputting text) and has higher intelligence than gpt-3.5-turbo but is just as fast.
openai/gpt-4-turbo
gpt-4-turbo from OpenAI has broad general knowledge and domain expertise allowing it to follow complex instructions in natural language and solve difficult problems accurately. It has a knowledge cutoff of April 2023 and a 128,000 token context window.
openai/gpt-3.5-turbo
openai/gpt-3.5-turbo-instruct
perplexity/sonar
perplexity/sonar-pro
perplexity/sonar-reasoning
perplexity/sonar-reasoning-pro
vertex/claude-3-7-sonnet-20250219
vertex/claude-3-5-sonnet-v2-20241022
vertex/claude-3-5-haiku-20241022
vertex/claude-3-opus-20240229
vertex/claude-3-haiku-20240307
vertex/claude-3-5-sonnet-20240620
vertex/gemini-2.0-flash-001
vertex/gemini-2.0-flash-lite-001
vertex/llama-4-scout-17b-16e-instruct-maas
vertex/llama-4-maverick-17b-128e-instruct-maas
Grok 2 is a frontier language model with state-of-the-art reasoning capabilities. It features advanced capabilities in chat, coding, and reasoning, outperforming both Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard.
xai/grok-2-vision-1212
Grok 2 vision model excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and document-based question answering (DocVQA). It can process a wide variety of visual information including documents, diagrams, charts, screenshots, and photographs.
xai/grok-3-beta
xai/grok-3-fast-beta
xAI's flagship model that excels at enterprise use cases like data extraction, coding, and text summarization. Possesses deep domain knowledge in finance, healthcare, law, and science. The fast model variant is served on faster infrastructure, offering response times that are significantly faster than the standard. The increased speed comes at a higher cost per output token.
xai/grok-3-mini-beta
xai/grok-3-mini-fast-beta
xAI's lightweight model that thinks before responding. Great for simple or logic-based tasks that do not require deep domain knowledge. The raw thinking traces are accessible. The fast model variant is served on faster infrastructure, offering response times that are significantly faster than the standard. The increased speed comes at a higher cost per output token.