Open AI Models Explained

Models

Flagship models

Model	Description	Input/Output	Context Length	Key Feature
GPT-4o	Versatile, high-intelligence flagship model	Text and image input, text output	128k tokens	Smarter model, higher price per token
GPT-4o mini	Fast, affordable small model for focused tasks	Text and image input, text output	128k tokens	Faster model, lower price per token
o1 & o1-mini	Reasoning models for complex tasks	Text and image input, text output	128k tokens	Uses additional tokens for reasoning

Models overview

The OpenAI API is powered by a diverse set of models with different capabilities and price points. You can also make customizations to our models for your specific use case with fine-tuning.

Model	Description
GPT-4o	Our versatile, high-intelligence flagship model
GPT-4o-mini	Our fast, affordable small model for focused tasks
o1 and o1-mini	Reasoning models that excel at complex, multi-step tasks
GPT-4o Realtime	GPT-4o models capable of realtime text and audio inputs and outputs
GPT-4o Audio	GPT-4o models capable of audio inputs and outputs via REST API
GPT-4 Turbo and GPT-4	The previous set of high-intelligence models
GPT-3.5 Turbo	A fast model for simple tasks, superceded by GPT-4o-mini
DALL·E	A model that can generate and edit images given a natural language prompt
TTS	A set of models that can convert text into natural sounding spoken audio
Whisper	A model that can convert audio into text
Embeddings	A set of models that can convert text into a numerical form
Moderation	A fine-tuned model that can detect whether text may be sensitive or unsafe
Deprecated	A full list of models that have been deprecated along with the suggested replacement

We have also published open source models including Point-E, Whisper, Jukebox, and CLIP.

Context window

Models on this page will list a context window, which refers to the maximum number of tokens that can be used in a single request, inclusive of both input, output, and reasoning tokens. For example, when making an API request to chat completions with the o1 model, the following token counts will apply toward the context window total:

Input tokens (inputs you include in the messages array with chat completions)

Output tokens (tokens generated in response to your prompt)

Reasoning tokens (used by the model to plan a response)

Tokens generated in excess of the context window limit may be truncated in API responses.

You can estimate the number of tokens your messages will use with the tokenizer tool.

Model ID aliases and snapshots

In the tables below, you will see model IDs that can be used in REST APIs like chat completions to generate outputs. Some of these model IDs are aliases which point to specific dated snapshots.

For example, the gpt-4o model ID is an alias that points to a specific dated snapshot of GPT-4o. The dated snapshots that these aliases point to are periodically updated to newer snapshots a few months after a newer snapshot becomes available. Model IDs that are aliases note the model ID they currently point to in the tables below.

API request using a model alias

In API requests where an alias was used as a model ID, the body of the response will contain the actual model ID used to generate the response.

Current model aliases

Below, please find current model aliases, and guidance on when they will be updated to new versions (if guidance is available).

|| |gpt-4o|gpt-4o-2024-08-06| |chatgpt-4o-latest|Latest used in ChatGPT| |gpt-4o-mini|gpt-4o-mini-2024-07-18| |o1|o1-2024-12-17| |o1-mini|o1-mini-2024-09-12| |o1-preview|o1-preview-2024-09-12| |gpt-4o-realtime-preview|gpt-4o-realtime-preview-2024-12-17| |gpt-4o-mini-realtime-preview|gpt-4o-mini-realtime-preview-2024-12-17| |gpt-4o-audio-preview|gpt-4o-audio-preview-2024-12-17|

In production applications, it is a best practice to use dated model snapshot IDs instead of aliases, which may change periodically.

GPT-4o

GPT-4o ("o" for "omni") is our versatile, high-intelligence flagship model. It accepts both text and image inputs, and produces text outputs (including Structured Outputs). Learn how to use GPT-4o in our text generation guide.

The chatgpt-4o-latest model ID below continuously points to the version of GPT-4o used in ChatGPT. It is updated frequently, when there are significant changes to ChatGPT's GPT-4o model.

The knowledge cutoff for GPT-4o models is October, 2023.

|| |gpt-4o↳ gpt-4o-2024-08-06|128,000 tokens|16,384 tokens| |gpt-4o-2024-11-20|128,000 tokens|16,384 tokens| |gpt-4o-2024-08-06|128,000 tokens|16,384 tokens| |gpt-4o-2024-05-13|128,000 tokens|4,096 tokens| |chatgpt-4o-latest↳ GPT-4o used in ChatGPT|128,000 tokens|16,384 tokens|

GPT-4o mini

GPT-4o mini ("o" for "omni") is a fast, affordable small model for focused tasks. It accepts both text and image inputs, and produces text outputs (including Structured Outputs). It is ideal for fine-tuning, and model outputs from a larger model like GPT-4o can be distilled to GPT-4o-mini to produce similar results at lower cost and latency.

The knowledge cutoff for GPT-4o-mini models is October, 2023.

|| |gpt-4o-mini↳ gpt-4o-mini-2024-07-18|128,000 tokens|16,384 tokens| |gpt-4o-mini-2024-07-18|128,000 tokens|16,384 tokens|

o1 and o1-mini

The o1 series of models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user. Learn about the capabilities of o1 models in our reasoning guide.

There are two model types available today:

o1: reasoning model designed to solve hard problems across domains

o1-mini: fast and affordable reasoning model for specialized tasks

The latest o1 model supports both text and image inputs, and produces text outputs (including Structured Outputs). o1-mini currently only supports text inputs and outputs.

The knowledge cutoff for o1 and o1-mini models is October, 2023.

|| |o1↳ o1-2024-12-17|200,000 tokens|100,000 tokens| |o1-2024-12-17|200,000 tokens|100,000 tokens| |o1-mini↳ o1-mini-2024-09-12|128,000 tokens|65,536 tokens| |o1-mini-2024-09-12|128,000 tokens|65,536 tokens| |o1-preview↳ o1-preview-2024-09-12|128,000 tokens|32,768 tokens| |o1-preview-2024-09-12|128,000 tokens|32,768 tokens|

GPT-4o and GPT-4o-mini Realtime

Beta

This is a preview release of the GPT-4o and GPT-4o-mini Realtime models. These models are capable of responding to audio and text inputs in realtime over WebRTC or a WebSocket interface. Learn more in the Realtime API guide.

The knowledge cutoff for GPT-4o Realtime models is October, 2023.

|| |gpt-4o-realtime-preview↳ gpt-4o-realtime-preview-2024-12-17|128,000 tokens|4,096 tokens| |gpt-4o-realtime-preview-2024-12-17|128,000 tokens|4,096 tokens| |gpt-4o-realtime-preview-2024-10-01|128,000 tokens|4,096 tokens| |gpt-4o-mini-realtime-preview↳ gpt-4o-mini-realtime-preview-2024-12-17|128,000 tokens|4,096 tokens| |gpt-4o-mini-realtime-preview-2024-12-17|128,000 tokens|4,096 tokens|

GPT-4o and GPT-4o-mini Audio

Beta

This is a preview release of the GPT-4o Audio models. These models accept audio inputs and outputs, and can be used in the Chat Completions REST API. Learn more.

The knowledge cutoff for GPT-4o Audio models is October, 2023.

|| |gpt-4o-audio-preview↳ gpt-4o-audio-preview-2024-12-17|128,000 tokens|16,384 tokens| |gpt-4o-audio-preview-2024-12-17|128,000 tokens|16,384 tokens| |gpt-4o-audio-preview-2024-10-01|128,000 tokens|16,384 tokens| |gpt-4o-mini-audio-preview↳ gpt-4o-mini-audio-preview-2024-12-17|128,000 tokens|16,384 tokens| |gpt-4o-mini-audio-preview-2024-12-17|128,000 tokens|16,384 tokens|

GPT-4 Turbo and GPT-4

GPT-4 is an older version of a high-intelligence GPT model, usable in Chat Completions. Learn more in the text generation guide. The knowledge cutoff for the latest GPT-4 Turbo version is December, 2023.

|| |gpt-4-turbo↳ gpt-4-turbo-2024-04-09|128,000 tokens|4,096 tokens| |gpt-4-turbo-2024-04-09|128,000 tokens|4,096 tokens| |gpt-4-turbo-preview↳ gpt-4-0125-preview|128,000 tokens|4,096 tokens| |gpt-4-0125-preview|128,000 tokens|4,096 tokens| |gpt-4-1106-preview|128,000 tokens|4,096 tokens| |gpt-4↳ gpt-4-0613|8,192 tokens|8,192 tokens| |gpt-4-0613|8,192 tokens|8,192 tokens| |gpt-4-0314|8,192 tokens|8,192 tokens|

GPT-3.5 Turbo

GPT-3.5 Turbo models can understand and generate natural language or code and have been optimized for chat using the Chat Completions API but work well for non-chat tasks as well.

As of July 2024, gpt-4o-mini should be used in place of gpt-3.5-turbo, as it is cheaper, more capable, multimodal, and just as fast. gpt-3.5-turbo is still available for use in the API.

Model	Context window	Max output tokens	Knowledge cutoff
gpt-3.5-turbo-0125The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Learn more.	16,385 tokens	4,096 tokens	Sep 2021
gpt-3.5-turboCurrently points to gpt-3.5-turbo-0125.	16,385 tokens	4,096 tokens	Sep 2021
gpt-3.5-turbo-1106GPT-3.5 Turbo model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Learn more.	16,385 tokens	4,096 tokens	Sep 2021
gpt-3.5-turbo-instructSimilar capabilities as GPT-3 era models. Compatible with legacy Completions endpoint and not Chat Completions.	4,096 tokens	4,096 tokens	Sep 2021

DALL·E

DALL·E is a AI system that can create realistic images and art from a description in natural language. DALL·E 3 currently supports the ability, given a prompt, to create a new image with a specific size. DALL·E 2 also support the ability to edit an existing image, or create variations of a user provided image.

DALL·E 3 is available through our Images API along with DALL·E 2. You can try DALL·E 3 through ChatGPT Plus.

Model	Description
dall-e-3	The latest DALL·E model released in Nov 2023. Learn more.
dall-e-2	The previous DALL·E model released in Nov 2022. The 2nd iteration of DALL·E with more realistic, accurate, and 4x greater resolution images than the original model.

TTS

TTS is an AI model that converts text to natural sounding spoken text. We offer two different model variates, tts-1 is optimized for real time text to speech use cases and tts-1-hd is optimized for quality. These models can be used with the Speech endpoint in the Audio API.

Model	Description
tts-1	The latest text to speech model, optimized for speed.
tts-1-hd	The latest text to speech model, optimized for quality.

Whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. The Whisper v2-large model is currently available through our API with the whisper-1 model name.

Currently, there is no difference between the open source version of Whisper and the version available through our API. However, through our API, we offer an optimized inference process which makes running Whisper through our API much faster than doing it through other means. For more technical details on Whisper, you can read the paper.

Embeddings

Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text. Embeddings are useful for search, clustering, recommendations, anomaly detection, and classification tasks. You can read more about our latest embedding models in the announcement blog post.

Model	Output Dimension
text-embedding-3-largeMost capable embedding model for both english and non-english tasks	3,072
text-embedding-3-smallIncreased performance over 2nd generation ada embedding model	1,536
text-embedding-ada-002Most capable 2nd generation embedding model, replacing 16 first generation models	1,536

Moderation

The Moderation models are designed to check whether content complies with OpenAI's usage policies. The models provide classification capabilities that look for content in categories like hate, self-harm, sexual content, violence, and others. Learn more about moderating text and images in our moderation guide.

Model	Max tokens	ㅤ
omni-moderation-latestCurrently points to omni-moderation-2024-09-26.	32,768	ㅤ
omni-moderation-2024-09-26Latest pinned version of our new multi-modal moderation model, capable of analyzing both text and images.	32,768	ㅤ
text-moderation-latestCurrently points to text-moderation-007.	32,768	ㅤ
text-moderation-stableCurrently points to text-moderation-007.	32,768	ㅤ
text-moderation-007Previous generation text-only moderation. We expect omni-moderation-* models to be the best default moving forward.	32,768	ㅤ

GPT base

GPT base models can understand and generate natural language or code but are not trained with instruction following. These models are made to be replacements for our original GPT-3 base models and use the legacy Completions API. Most customers should use GPT-3.5 or GPT-4.

Model	Max tokens	Knowledge cutoff
babbage-002Replacement for the GPT-3 ada and babbage base models.	16,384 tokens	Sep 2021
davinci-002Replacement for the GPT-3 curie and davinci base models.	16,384 tokens	Sep 2021

How we use your data

Your data is your data.

As of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models (unless you explicitly opt in to share data with us).

To help identify abuse, API data may be retained for up to 30 days, after which it will be deleted (unless otherwise required by law). For trusted customers with sensitive applications, zero data retention may be available. With zero data retention, request and response bodies are not persisted to any logging mechanism and exist only in memory in order to serve the request.

Note that this data policy does not apply to OpenAI's non-API consumer services like ChatGPT or DALL·E Labs.

Default usage policies by endpoint

Endpoint	Data used for training	Default retention	Eligible for zero retention
/v1/chat/completions*	No	30 days	Yes, except (a) image inputs, (b) schemas provided for Structured Outputs, or (c) audio outputs. *
/v1/assistants	No	30 days **	No
/v1/threads	No	30 days **	No
/v1/threads/messages	No	30 days **	No
/v1/threads/runs	No	30 days **	No
/v1/vector_stores	No	30 days **	No
/v1/threads/runs/steps	No	30 days **	No
/v1/images/generations	No	30 days	No
/v1/images/edits	No	30 days	No
/v1/images/variations	No	30 days	No
/v1/embeddings	No	30 days	Yes
/v1/audio/transcriptions	No	Zero data retention	-
/v1/audio/translations	No	Zero data retention	-
/v1/audio/speech	No	30 days	Yes
/v1/files	No	Until deleted by customer	No
/v1/fine_tuning/jobs	No	Until deleted by customer	No
/v1/batches	No	Until deleted by customer	No
/v1/moderations	No	Zero data retention	-
/v1/completions	No	30 days	Yes
/v1/realtime (beta)	No	30 days	Yes

\ Chat Completions:*

Image inputs via the o1, gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention.

Audio outputs are stored for 1 hour to enable multi-turn conversations, and are not currently eligible for zero retention.

When Structured Outputs is enabled, schemas provided (either as the response_format or in the function definition) are not eligible for zero retention, though the completions themselves are.

When using Stored Completions via the store: true option in the API, those completions are stored for 30 days. Completions are stored in an unfiltered form after an API response, so please avoid storing completions that contain sensitive data.

\\ Assistants API:

Objects related to the Assistants API are deleted from our servers 30 days after you delete them via the API or the dashboard. Objects that are not deleted via the API or dashboard are retained indefinitely.

Evaluations:

Evaluation data: When you create an evaluation, the data related to that evaluation is deleted from our servers 30 days after you delete it via the dashboard. Evaluation data that is not deleted via the dashboard is retained indefinitely.

For details, see our API data usage policies. To learn more about zero retention, get in touch with our sales team.

Model endpoint compatibility

Endpoint	Latest models
/v1/assistants	All GPT-4o (except chatgpt-4o-latest), GPT-4o-mini, GPT-4, and GPT-3.5 Turbo models. The retrieval tool requires gpt-4-turbo-preview (and subsequent dated model releases) or gpt-3.5-turbo-1106 (and subsequent versions).
/v1/audio/transcriptions	whisper-1
/v1/audio/translations	whisper-1
/v1/audio/speech	tts-1, tts-1-hd
/v1/chat/completions	All GPT-4o (except for Realtime preview), GPT-4o-mini, GPT-4, and GPT-3.5 Turbo models and their dated releases. chatgpt-4o-latest dynamic model. Fine-tuned versions of gpt-4o, gpt-4o-mini, gpt-4, and gpt-3.5-turbo.
/v1/completions (Legacy)	gpt-3.5-turbo-instruct, babbage-002, davinci-002
/v1/embeddings	text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
/v1/fine_tuning/jobs	gpt-4o, gpt-4o-mini, gpt-4, gpt-3.5-turbo
/v1/moderations	text-moderation-stable, text-moderation-latest
/v1/images/generations	dall-e-2, dall-e-3
/v1/realtime (beta)	gpt-4o-realtime-preview, gpt-4o-realtime-preview-2024-10-01

Text generation

Learn how to generate text from a prompt.

OpenAI provides simple APIs to use a large language model to generate text from a prompt, as you might using ChatGPT. These models have been trained on vast quantities of data to understand multimedia inputs and natural language instructions. From these prompts, models can generate almost any kind of text response, like code, mathematical equations, structured JSON data, or human-like prose.

Quickstart

To generate text, you can use the chat completions endpoint in the REST API, as seen in the examples below. You can either use the REST API from the HTTP client of your choice, or use one of OpenAI's official SDKs for your preferred programming language.

Generate prose

Create a human-like response to a prompt

Analyze an image

Describe the contents of an image

Generate JSON data

Generate JSON data based on a JSON Schema

Choosing a model

When making a text generation request, your first decision is which model you want to generate the response. The model you choose influences output and impacts cost.

A large model like gpt-4o offers a very high level of intelligence and strong performance, with higher cost per token.

A small model like gpt-4o-mini offers intelligence not quite on the level of the larger model, but it's faster and less expensive per token.

A reasoning model like the o1 family of models is slower to return a result, and uses more tokens to "think," but is capable of advanced reasoning, coding, and multi-step planning.

Experiment with different models in the Playground to see which works best for your prompts! You might also benefit from our model selection best practices.

Building prompts

The process of crafting prompts to get the right output from a model is called prompt engineering. You can improve output by giving the model precise instructions, examples, and necessary context information—like private or specialized information not included in the model's training data.

Below is high-level guidance on building prompts. For more in-depth strategies and tactics, see the prompt engineering guide.

Messages and roles

In the chat completions API, you create prompts by providing an array of messages that contain instructions for the model. Each message can have a different role, which influences how the model might interpret the input.

|| |user|Instructions that request some output from the model. Similar to messages you'd type in ChatGPT as an end user.|Pass your end-user's message to the model.Write a haiku about programming.| |developer|Instructions to the model that are prioritized ahead of user messages, following chain of command. Previously called the system prompt.|Describe how the model should generally behave and respond.You are a helpful assistant that answers programming questions in the style of a southern belle from the southeast United States.Now, any response to a user message should have a southern belle personality and tone.| |assistant|A message generated by the model, perhaps in a previous generation request (see the "Conversations" section below).|Provide examples to the model for how it should respond to the current request.For example, to get a model to respond correctly to knock-knock jokes, you might provide a full back-and-forth dialogue of a knock-knock joke.|

Message roles may help you get better responses, especially if you want a model to follow hierarchical instructions. They're not deterministic, so the best way to use them is just trying things and seeing what gives you good results.

Here's an example of a developer message that modifies the behavior of the model when generating a response to a user message:

This prompt returns a text output in the rhetorical style requested:

Giving the model additional data to use for generation

You can also use the message types above to provide additional information to the model, outside of its training data. You might want to include the results of a database query, a text document, or other resources to help the model generate a relevant response. This technique is often referred to as retrieval augmented generation, or RAG. Learn more about RAG techniques.

Conversations and context

While each text generation request is independent and stateless (unless you're using assistants), you can still implement multi-turn conversations by providing additional messages as parameters to your text generation request. Consider a "knock knock" joke:

By using alternating user and assistant messages, you capture the previous state of a conversation in one request to the model.

Managing context for text generation

As your inputs become more complex, or you include more turns in a conversation, you'll need to consider both output token and context window limits. Model inputs and outputs are metered in tokens, which are parsed from inputs to analyze their content and intent and assembled to render logical outputs. Models have limits on token usage during the lifecycle of a text generation request.

Output tokens are the tokens generated by a model in response to a prompt. Each model has different limits for output tokens. For example, gpt-4o-2024-08-06 can generate a maximum of 16,384 output tokens.

A context window describes the total tokens that can be used for both input and output tokens (and for some models, reasoning tokens). Compare the context window limits of our models. For example, gpt-4o-2024-08-06 has a total context window of 128k tokens.

If you create a very large prompt (usually by including a lot of conversation context or additional data/examples for the model), you run the risk of exceeding the allocated context window for a model, which might result in truncated outputs.

Use the tokenizer tool, built with the tiktoken library, to see how many tokens are in a particular string of text.

Optimizing model outputs

As you iterate on your prompts, you'll continually aim to improve accuracy, cost, and latency. Below, find techniques that optimize for each goal.

|| |Accuracy|Ensure the model produces accurate and useful responses to your prompts.|Accurate responses require that the model has all the information it needs to generate a response, and knows how to go about creating a response (from interpreting input to formatting and styling). Often, this will require a mix of prompt engineering, RAG, and model fine-tuning.Learn more about optimizing for accuracy.| |Cost|Drive down total cost of using models by reducing token usage and using cheaper models when possible.|To control costs, you can try to use fewer tokens or smaller, cheaper models. Learn more about optimizing for cost.| |Latency|Decrease the time it takes to generate responses to your prompts.|Optimizing for low latency is a multifaceted process including prompt engineering and parallelism in your own code. Learn more about optimizing for latency.|

Vision

Learn how to use vision capabilities to understand images.

Many OpenAI models have vision capabilities, meaning the models can take images as input and answer questions about them. Historically, language model systems were limited to a single input modality, text.

Quickstart

Images are made available to the model in two main ways: by passing a link to the image or by passing the Base64 encoded image directly in the request. Images can be passed in the user messages.

Analyze the content of an image

The model is best at answering general questions about what is present in the images. Although it understands the relationship between objects in images, it's not yet optimized to answer detailed questions about objects' locations in an image. For example, you can ask it what color a car is, or for some dinner ideas based on what's in your fridge, but if you show the model an image of a room and ask where the chair is, it may not answer the question correctly.

Keep model limitations in mind as you explore use cases for visual understanding.

[

Video understanding with vision

Learn how to use use GPT-4 with Vision to understand videos in the OpenAI Cookbook

](https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding)

Uploading Base64 encoded images

If you have an image or set of images locally, pass them to the model in Base64 encoded format:

Multiple image inputs

The Chat Completions API is capable of taking in and processing multiple image inputs, in Base64 encoded format or as an image URL. The model processes each image and uses information from all images to answer the question.

Multiple image inputs

Here, the model is shown two copies of the same image. It can answer questions about both images or each image independently.

Low or high fidelity image understanding

The detail parameter—which has three options, low, high, and auto—gives you control over how the model processes the image and generates its textual understanding. By default, the model will use the auto setting, which looks at the image input size and decides if it should use the low or high setting.

low enables the "low res" mode. The model receives a low-resolution 512px x 512px version of the image. It represents the image with a budget of 85 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.

high enables "high res" mode, which first lets the model see the low-resolution image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile.

Choosing the detail level

Managing images

Unlike the Assistants API, the Chat Completions API isn't stateful. That means you have to manage messages (including images) you pass to the model. To pass the same image to the model multiple times, you have to pass the image each time you make a request to the API.

For long-running conversations, we suggest passing images via URLs instead of Base64. The latency of the model can also be improved by downsizing your images ahead of time to less than the maximum size.

Image size guidelines

We restrict image uploads to 20MB per image. Here are our image size expectations.

|| |Low-res|512px x 512px| |High res|Short side: less than 768pxLong side: less than 2,000px|

After an image has been processed by the model, it's deleted from OpenAI servers and not retained. We do not use data uploaded via the OpenAI API to train our models.

Limitations

While GPT-4 with vision is powerful and can be used in many situations, it's important to understand the limitations of the model. Here are some known limitations:

Medical images: The model is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

Non-English: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

Small text: Enlarge text within the image to improve readability, but avoid cropping important details.

Rotation: The model may misinterpret rotated or upside-down text and images.

Visual elements: The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary.

Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

Accuracy: The model may generate incorrect descriptions or captions in certain scenarios.

Image shape: The model struggles with panoramic and fisheye images.

Metadata and resizing: The model doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.

Counting: The model may give approximate counts for objects in images.

CAPTCHAS: For safety reasons, our system blocks the submission of CAPTCHAs.

Calculating costs

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of an image is determined by two factors: size and detail.

Low res cost

Any image with detail: low costs 85 tokens.

High res cost

To calculate the cost of an image with detail: high, we do the following:

Scale to fit within a 2048px x 2048px square, maintaining original aspect ratio

Scale so that the image's shortest side is 768px long

Count the number of 512px squares in the image—each square costs 170 tokens

Add 85 tokens to the total

Cost calculation examples

A 1024 x 1024 square image in detail: high mode costs 765 tokens

1024 is less than 2048, so there is no initial resize.
The shortest side is 1024, so we scale the image down to 768 x 768.
4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.

A 2048 x 4096 image in detail: high mode costs 1105 tokens

We scale down the image to 1024 x 2048 to fit within the 2048 square.
The shortest side is 1024, so we further scale down to 768 x 1536.
6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.

A 4096 x 8192 image in detail: low most costs 85 tokens

Regardless of input size, low detail images are a fixed cost.

FAQ

Can I fine-tune the image capabilities in `gpt-4`?

Vision fine-tuning is available for some models. Learn more.

Can I use `gpt-4` to generate images?

No, you can use dall-e-3 to generate images and gpt-4o, gpt-4o-mini, or gpt-4-turbo to understand images.

What type of files can I upload?

We support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).

Is there a limit to the size of the image I can upload?

Yes, we restrict image uploads to 20MB per image.

Can I delete an image I uploaded?

No, we will delete the image for you automatically after it has been processed by the model.

Where can I learn more about the considerations of GPT-4 with Vision?

You can find details about our evaluations, preparation, and mitigation work in the GPT-4 with Vision system card.

We have further implemented a system to block the submission of CAPTCHAs.

How do rate limits for GPT-4 with Vision work?

We process images at the token level, so each image we process counts towards your tokens per minute (TPM) limit. See the calculating costs section for details on the formula used to determine token count per image.

Can GPT-4 with Vision understand image metadata?

No, the model does not receive image metadata.

What happens if my image is unclear?

If an image is ambiguous or unclear, the model will do its best to interpret it. However, the results may be less accurate. A good rule of thumb is that if an average human cannot see the info in an image at the resolutions used in low/high res mode, the model cannot either.

Models

Flagship models

Models overview

Context window

Model ID aliases and snapshots

Current model aliases

GPT-4o

GPT-4o mini

GPT-4 Turbo and GPT-4

GPT-3.5 Turbo

DALL·E

TTS

Whisper

Embeddings

Moderation

GPT base

How we use your data

Default usage policies by endpoint

Model endpoint compatibility

Text generation

Quickstart

Choosing a model

Building prompts

Messages and roles

Giving the model additional data to use for generation

Conversations and context

Managing context for text generation

Optimizing model outputs

Vision

Quickstart

Uploading Base64 encoded images

Multiple image inputs

Low or high fidelity image understanding

Managing images

Image size guidelines

Limitations

Calculating costs

Low res cost

High res cost

Cost calculation examples

FAQ

Can I fine-tune the image capabilities in gpt-4?

Can I use gpt-4 to generate images?

What type of files can I upload?

Is there a limit to the size of the image I can upload?

Can I delete an image I uploaded?

Where can I learn more about the considerations of GPT-4 with Vision?

How do rate limits for GPT-4 with Vision work?

Can GPT-4 with Vision understand image metadata?

What happens if my image is unclear?

Can I fine-tune the image capabilities in `gpt-4`?

Can I use `gpt-4` to generate images?