OpenAI is embarking on a significant expansion of ChatGPT’s capabilities by introducing voice and image functionality.
This development represents a major leap in making your interactions with ChatGPT more intuitive and versatile. It allows you to engage in voice conversations with the AI or share images to enhance communication.
The integration of voice and image features opens up a myriad of possibilities for ChatGPT users. For example, while travelling, you can capture a photo of a landmark and engage in a live conversation with ChatGPT to discover interesting facts about it.
When you’re back home, click pictures of your fridge and pantry to help plan your dinner menu, and even ask follow-up questions for step-by-step recipes.
These enhancements are set to roll out to Plus and Enterprise users over the next two weeks. The voice feature will be available on both iOS and Android, and users can opt-in through their settings. Images will be accessible on all platforms.
Voice interaction
Users can engage in fluid voice conversations with ChatGPT. Whether on the move or looking for a bedtime story, ChatGPT can engage in a back-and-forth conversation.
To get started with voice, navigate to Settings in the mobile app, choose New Features, and opt into voice conversations. Then, tap the headphone icon in the top-right corner of the home screen and select your preferred voice from five options.
For the voice features, OpenAI collaborated with professional voice actors to craft each of these voices, and use their open-source speech recognition system, Whisper, to transcribe spoken words into text.
Use your voice to engage in a back-and-forth conversation with ChatGPT. Speak with it on the go, request a bedtime story, or settle a dinner table debate.
Image interaction
Users can present ChatGPT with one or more images, opening up endless possibilities. Troubleshoot issues like why your grill won’t start or analyze complex graphs for work-related data.
To initiate image-based interactions, users can tap the photo button to capture or select an image. For iOS and Android users, start by tapping the plus button. You can also discuss multiple images or use the drawing tool to guide your assistant.
ChatGPT’s image understanding is made possible by the combination of multimodal models GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide array of images, including photographs, screenshots, and documents containing both text and images.
New risks
While the new voice technology can create realistic synthetic voices, it also poses new challenges and risks, such as the potential for malicious actors to impersonate public figures or commit fraud.
To address this, OpenAI is initially deploying this technology for a specific use case—voice chat. The voices used have been carefully crafted with voice actors directly engaged by OpenAI. Collaborations with other entities, like Spotify for Voice Translation, exemplify how this technology can enhance accessibility and creativity.
Vision-based models, on the other hand, bring unique challenges, including hallucinations and interpretations in high-stakes domains. Prior to broader deployment, OpenAI conducted rigorous testing, involving red teamers to assess risks in domains like extremism and scientific proficiency, as well as diverse alpha testers. This research enabled the company to establish key guidelines for responsible usage.
Transparency and limitations
OpenAI is committed to transparency regarding the limitations of ChatGPT. While the model excels at transcribing English text, its performance may be suboptimal with some other languages, particularly those using non-Roman scripts. Therefore, non-English users are advised against relying on ChatGPT for specialized purposes in these languages.
News Source: Gulf Business