What does it take to create a ChatGPT voice assistant?

The way ChatGPT blew up; even OpenAI’s president Greg Brockman and executives say they hadn’t expected that much. Just one week after being launched on November 30th last year, the super-intelligent chatbot crossed 1 Million users. This shows just how badly people needed a smart AI-powered assistant to talk with. Now, whenever digital assistants have come forward, voice features are the ones that follow. Take Google’s assistant, for example. In fact, even blogs are now commonly using TTS(Text-To-Speech) AI APIs to read out articles. And when it comes to AI the level of ChatGPT, the expectations after it enables voice features are very high. Like… from Rowan Cheung‘s recent tweet calling ChatGPT a free money printer to hbr’s review calling it AI’s tippling point. Not only does it have to be conversational, but it also has to sound natural and human-like. Of course, only that will do justice to the generative abilities ChatGPT possesses.

How ChatGPT Works

ChatGPT is a member of the GPT family of language models developed by OpenAI. Other GPT models, including the latest davinci-003, focus on language generation tasks. ChatGPT has more conversational training data. Just like any other GPT model, ChatGPT is transformer-based. It works by predicting the next word in a sentence based on the input text, using deep neural networks and a self-attention mechanism. The model has 175 billion parameters and was trained on over 570 GB of text data from various sources. Apart from common Crawl, sources include web pages, books, and Wikipedia articles. The training took over 3 months on then-high-performance GPUs (it took place in 2021). The model’s ability to generate coherent and diverse text, answer questions, summarize text, translate languages, and perform other language tasks makes it a powerful tool for natural language processing applications.

Why is there no voice version of ChatGPT yet?

It looks so easy on the surface — just combine a text-to-speech (TTS) model with a GPT model, right? Well, it’s not impossible by any means. But still, it’s not as simple as it looks. Looking from OpenAI’s perspective, adding TTS to the ChatGPT model would add an extra layer of complexity. From additional resources like GPUs to storage, developers need to figure out how to make the model work efficiently. Integrating a TTS model with GPT would also require a lot of additional budgets, training time, and resources. High-quality audio, accurate speech recognition, again, are must to maintain ChatGPT’s reputation. For that, a partnership with a good TTS provider would be necessary, which can be costly and time-consuming, especially now, as ChatGPT is available for free. (OpenAI itself has stated that ChatGPT is in its feedback stage.)

When will we see ChatGPT voice assistant?

It’s impossible to predict the exact day and time of ChatGPT voice assistant’s launch. However, we can take the available information and speculate.

a. Budget Problem

Sources state that OpenAI executives are discussing a $42 monthly subscription fee for ChatGPT. If that happens, then the company will probably be able to invest in TTS. After looking at ChatGPT, Microsoft has already confirmed its $10 Billion investment in OpenAI; it’s a huge step forward. Remember how Microsoft invested $240 Million in Facebook back in 2007? They know how to invest in the right tech and turn them “giants”.

b. Training a Model

Once the budget is in place, the next step would be training a TTS model. OpenAI will need to train a model that can generate convincing and accurate audio from text. It will also need to be powerful enough to handle the conversational abilities of ChatGPT. ChatGPT servers are already famous for crashing due to heavy load. TTS models can add an extra load, so OpenAI will need to be extra careful about this.

c. Version Management

We are yet to see whether the voice feature will cost extra or be included in the existing ChatGPT subscription. In either case, maintaining two different versions of the product — text-only and voice-enabled — will require extra effort.

d. Artificial Voice

OpenAI already has its “whisper“, an ASR (Automatic Speech Recognition) system. However, they may need to tweak the system to match the naturalness and accuracy of human voices. As mentioned earlier, partnering with a good TTS provider is the likely way for them to go.

We can estimate the ChatGPT voice assistant to arrive sometime in Q3-Q4, 2023.

Voice control browser extension for ChatGPT

The ChatGPT Voice Extension is a hidden gem that many are not aware of. This amazing tool allows you to interact with the ChatGPT preview from OpenAI using just your voice. The option to record your voice and have responses read aloud makes the conversation feel more natural and immersive. It also offers press-and-hold shortcuts such as holding down the SPACE key outside the text input to record, releasing to submit, and pressing ESC or Q to cancel the transcription. Additionally, you can press E to stop and copy a transcription to the ChatGPT input. The extension supports multiple languages, making it accessible to a wide range of users. The extension is easy to use and supports multiple languages. Despite its usefulness, this extension is not widely known and is definitely worth discovering for anyone looking to enhance their ChatGPT experience.

Please note that this article is not sponsored by or affiliated with “OpenAI” or “Voice control browser extension for ChatGPT” by any means. It is the user’s responsibility to ensure that they are comfortable with the level of privacy and security provided by the extension, and to make informed decisions about what information they share online.

Try Now

Conclusion

As with any other AI-based technology, OpenAI’s ChatGPT also requires lots of resources, training, and budget to incorporate a voice feature. Due to its enormous capabilities, people often tend to forget that ChatGPT is still in its early phases. There’s a lot to come if we look at the improvements each new generation of GPT models have brought.