Apr 03, 2026 12:20:00

Microsoft has released three AI platform models: the voice generation model 'MAI-Voice-1,' the voice recognition model 'MAI-Transcribe-1,' and the image generation model 'MAI-Image-2.'

Microsoft has announced three new AI platform models developed in-house: ' MAI-Voice-1 ' for speech generation, ' MAI-Transcribe-1 ' for speech recognition, and ' MAI-Image-2 ' for image generation.

Today we're announcing 3 new world class MAI models, available in Foundry | Microsoft AI

https://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/

Introducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry | Microsoft Community Hub
https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-mai-transcribe-1-mai-voice-1-and-mai-image-2-in-microsoft-foundry/4507787

We're bringing our growing MAI model family to every developer in Foundry, including …

· MAI-Transcribe-1, most accurate transcription model in world across 25 languages
· MAI-Voice-1, natural, expressive speech generation
· MAI-Image-2, our most capable image model yet

Start… pic.twitter.com/p0DZZcAUZ4
— Satya Nadella (@satyanadella) April 2, 2026

Today, we announced the public preview of MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Microsoft Foundry, bringing our first-party AI models directly into the hands of developers. Read more: https://t.co/MYsQLNP8LK pic.twitter.com/hkz7joqpLb
— Microsoft Azure (@Azure) April 2, 2026

The MAI family, developed by Microsoft, is a group of AI models that are superior to competing AI models, being fast and inexpensive. For example, MAI-Transcribe-1 costs from $0.36 per hour (approximately 57 yen), MAI-Voice-1 costs from $22 per million characters (approximately 3,500 yen), and MAI-Image-2 costs from $5 per million tokens (approximately 800 yen) for text input and from $33 per million tokens (approximately 5,270 yen) for image output.

'MAI-Transcribe-1,' 'MAI-Voice-1,' and 'MAI-Image-2' are available from Microsoft Foundry and MAI Playground , but MAI Playground is only available from the United States at the time of writing.

◆MAI-Transcribe-1
The MAI-Transcribe-1 speech recognition model was compared in FLEURS, an industry-standard speech task benchmark, to see how well it could transcribe speech into text in the top 25 most frequently used languages worldwide ( including Japanese ).

The graph below compares the word error rate (WER) of competing models, with MAI-Transcribe-1 recording the lowest at 3.9%.

Furthermore, MAI-Transcribe-1 not only delivers excellent performance but also operates at an astonishingly fast pace. In addition, MAI-Transcribe-1 is now available on

Microsoft Foundry , achieving the best price-performance ratio among major cloud providers.

◆MAI-Voice-1
MAI-Voice-1 is a top-of-the-line AI speech generation model developed by Microsoft. It is designed to generate natural and realistic speech, and excels in its ability to convey nuances, emotional range, and rich expressiveness without compromising the speaker's individuality, even in long-form content.

MAI-Voice-1 is now available on Microsoft Foundry, allowing you to securely and reliably create your own custom voice from just a few seconds of audio data. Microsoft says MAI-Voice-1 'fundamentally transforms how developers build high-quality, fast voice experiences and voice agents.'

MAI-Voice-1 can generate 60 seconds of audio in just one second, and its highly efficient GPU utilization results in an excellent balance of quality and cost. MAI-Voice-1 is also available for use with Copilot Audio Expressions .

[Copilot Speech Representation] - Experiments by Copilot Labs
https://copilot.microsoft.com/labs/audio-expression

◆MAI-Image-2
MAI-Image-2 is an image generation model that ranks among the top 3 in performance on Arena.ai's leaderboard, a benchmark that compares the image generation performance of AI models. It was announced on March 19th and has already contributed to improving Copilot's image generation performance. Based on actual operational traffic data, it achieves at least twice the generation time while maintaining equivalent quality on Microsoft Foundry and Copilot.

MAI-Image-2 can generate natural lighting, accurate skin tones and textures, charts, layouts, and crisp in-image text. Furthermore, MAI-Image-2 is offered at a competitive price-performance ratio. WPP Group , one of the world's largest advertising agencies, is already using MAI-Image-2 on a large scale as an enterprise partner.

The following is an example of an image created by the WPP group using MAI-Image-2.

Mustafa Suleyman, CEO of Microsoft AI, also showcases the types of images that can be generated with MAI-Image-2. Below is an image generated with the prompt: 'A close-up, zoom-in macro photograph of a vibrant orange clownfish hiding among a pure white peony with bright yellow stamens. High contrast, shallow depth of field, vivid wildlife photography.'

One place MAI-Image-2 really knocks it out of the park is surrealist images. Try this one:

Close-up zoomed in macro photo of a bright orange clownfish hiding among stark white peonies with bright yellow stamens. High contrast, shallow depth of field, vibrant wildlife… https://t.co/mg7gRY26ay pic.twitter.com/0oUKJKvzVg
— Mustafa Suleyman (@mustafasuleyman) April 1, 2026

In addition, Suleyman has spoken about the newly released MAI family in interviews with VentureBeat and The Verge .

Related Posts:

Apr 03, 2026 12:20:00 in AI, Posted by logu_ii