ChatGPT finally supports 'viewing,' 'listening,' and 'speaking,' allowing you to have voice conversations and ask questions about the content of photos just like humans.



OpenAI has announced that it will add ``a function to judge and respond to the content of images'' and ``a function to communicate by voice instead of text'' to ChatGPT. This not only makes it possible to communicate visually, such as ``showing the contents of the refrigerator to someone suggesting a recipe,'' but it also makes it possible to communicate verbally in the same way as having a conversation with a human.

ChatGPT can now see, hear, and speak

https://openai.com/blog/chatgpt-can-now-see-hear-and-speak

According to OpenAI, within the next two weeks, subscribers of paid plans ``ChatGPT Plus'' and ``ChatGPT Enterprise'' will be provided with the ability to recognize and respond to image content and the ability to communicate by voice. Among these, the image recognition function will be available on all platforms, and the voice communication function will be available only on iOS version ChatGPT and Android version ChatGPT.

◆Image recognition function
With the image recognition function, by inputting an image into ChatGPT, image recognition processing is performed using GPT 3.5 or GPT 4, and a response is returned based on the image. For example, you can perform operations such as ``Show me the contents of your refrigerator and ask me to suggest a recipe'' or ``Show me a graph and ask me to explain the main points.''

In the example below, we show a photo of a bicycle and ask ChatGPT, ``How do I lower the saddle of my bicycle?'' As a result, the user tells me how to lower the saddle by ``operating the quick release lever or bolt.'' .



Furthermore, when I surrounded a part of the photo with a white line and asked, ``Is this a lever?'', he answered, ``No, it's a bolt. You'll need a hex wrench to loosen it.''



I then showed him a picture of the bicycle's manual and tool box and asked, ``The manual and tool box look like this, but are there any tools that match?'' and he answered, ``The one on the left side of the tool box is labeled 'DEWALT.' The correct tool is the one you should use.''



◆Voice communication function
The voice communication function screen looks like this. The content spoken by the user is recognized by the transcription AI ' Whisper ', and ChatGPT responds with voice to the content of the utterance.



ChatGPT's voice is created using an ``AI model that can create synthetic speech using only text and a few seconds of voice samples,'' and at the time of writing, five types of voice samples have been released. If you play the movie below and listen to the ChatGPT voice sample, you will see that a fairly natural voice is output.

Sample of ChatGPT voice conversation function - YouTube


The AI model used to create ChatGPT's synthesized speech is also used in the automatic translation function currently being tested at Spotify.

in Software, Posted by log1o_hf