GPT-4o, a new model from OpenAI
Boonyawee Sirimaya
min read
May 14, 2024

GPT-4o: Next-Gen AI for Enhanced Interaction

OpenAI launched GPT-4o, the newest flagship model with real-time reasoning spanning text, visual, and audio.

What is GPT-4o?

GPT-4o is a groundbreaking AI model that seamlessly integrates and processes audio, visual, and textual information in real-time. The "o" in GPT-4o stands for "omni," highlighting its ability to accept and generate any combination of text, audio, and images. This cutting-edge technology marks a significant step towards more intuitive and natural human-computer interaction.

Lightning-Fast Response Times and Improved Language Processing

One of the most impressive features of GPT-4o is its swift response time to audio inputs, averaging just 320 milliseconds, which is comparable to human response time in conversation. The model matches the performance of GPT-4 Turbo in English text and code processing while significantly improving its handling of non-English languages. Moreover, GPT-4o is faster and 50% more cost-effective when accessed through the API. Compared to existing models, GPT-4o demonstrates superior vision and audio understanding capabilities.

Overcoming the Limitations of Voice Mode

Before the introduction of GPT-4o, users could interact with ChatGPT using Voice Mode, but the experience was hindered by latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. Voice Mode relied on a pipeline of three separate models: one for transcribing audio to text, another (GPT-3.5 or GPT-4) for processing the text input and generating a text output, and a third for converting the output text back to audio. 

This fragmented process limited GPT-4's ability to capture nuances like tone, multiple speakers, and background noises, and prevented it from generating outputs like laughter, singing, or emotional expressions.

Example of GPT-4o capabilities
Example of GPT-4o Capabilities (OpenAI, 2024)


GPT-4o demonstrates remarkable performance across various benchmarks, rivaling GPT-4 Turbo in text, reasoning, and coding intelligence while setting new standards in multilingual, audio, and vision capabilities.

Text Evaluation of GPT-4o
The chart illustrates the performance of text ability compared to other competitors (OpenAI, 2024)

The model's enhanced reasoning abilities are exemplified by its impressive 88.7% score on the 0-shot COT MMLU, which tests general knowledge. Additionally, GPT-4o achieves a new high score of 87.2% on the traditional 5-shot no-CoT MMLU. These evaluations were conducted using the new simple evals library.

Safety Measures and Limitations

GPT-4o prioritizes safety by design, incorporating filtered training data and post-training refinement across all modalities. Rigorous evaluations according to the Preparedness Framework and voluntary commitments indicate no scores above Medium risk in cybersecurity, CBRN, persuasion, and model autonomy.

External red teaming with 70+ experts has identified risks associated with new modalities, leading to the development of safety interventions. Initially, only text and image inputs and text outputs are included in the public release, with audio outputs limited to preset voices adhering to safety policies.

Despite advancements, GPT-4o exhibits limitations across all modalities, observed through extensive testing and iteration.

Availability of GPT-4o

GPT-4o represents a significant advancement in deep learning, with a focus on practical usability. Extensive research and development efforts over the past two years have led to efficiency improvements at every level of the technology stack. As a result, a GPT-4 level model can now be made available to a much broader audience.

The capabilities of GPT-4o will be introduced in a phased manner, with extended red team access commencing today. The text and image capabilities of GPT-4o are being integrated into ChatGPT, accessible to both free tier and Plus users, with the latter benefiting from up to 5x higher message limits. In the coming weeks, a new version of Voice Mode powered by GPT-4o will be launched in alpha within ChatGPT Plus.

Developers can now access GPT-4o through the API as a text and vision model, enjoying 2x faster performance, half the price, and 5x higher rate limits compared to GPT-4 Turbo. Plans are in place to introduce GPT-4o's groundbreaking audio and video capabilities to a select group of trusted partners via the API in the coming weeks.

Consult with our experts at Amity Solutions for additional information on Chatbot here