Here, at DigitalOcean we are very excited about voice models. Can you blame us? Open-source speech recognition and TTS models are getting so good that we are optimistic for their adoption in just about any application where voice technology makes sense. From enhancing accessibility to improving user interfaces across devices (e.g. smartphones, smart glasses, robots, and voice-operated televisions), we are excited for improved user experiences.
If we take a look at agentic computer-use, for example, we’re bottlenecked by our ability to type and click. Sometimes our minds are running faster than our ability to convey our thinking and therefore it isn’t unlikely that voice will prove to be a more expeditious medium for articulating user intent. That being said, an office with ten people shouting instructions to their laptops is far from ideal, but having voice operation as an option can certainly be beneficial
When it comes to implementation, current systems for spoken dialogue typically depend on pipelines of independently functional components, including voice activity detection, speech recognition, textual dialogue (often from an LLM) and text-to-speech.
Delays build up through the different parts of these systems, making the total response time several seconds. This is much slower than natural conversations, which have response times in the hundreds of milliseconds. While considerable progress has been made, interrupting current voice AI systems during their response still feels unnatural and awkward. Additionally, because many of these voice pipelines typically just understand and generate text, they cannot process any information that isn’t written.
For those interested, the paper, Moshi: a speech-text foundation model for real-time dialogue, does a good job of illustrating the limitations of voice AI in its introduction. On another note, the Conversational Speech Model (CSM) from Sesame (so so cool), which we covered in the past, borrows from this paper with their advanced tokenizer for discretizing high-fidelity audio information.
Anywaysssss, as the focus of this article is a TTS model called Chatterbox, let’s turn your attention to Text-to-Speech, shall we?
Text-to-Speech (TTS) models, as their name suggests, convert text into speech. We have all heard that personalization is one of the biggest leverage-points of AI. When it comes to Voice AI, TTS models with voice cloning capabilities allow one to tailor voices to desired languages, accents, and emotional tones so that interactions feel more personal and engaging. An excellent application is audiobooks where entire books can be generated in the author’s voice. We’ll show you how you can potentially approach this in the implementation section of this article.
Thanks to TTS models, information isn’t just something you read, it’s something you can absorb while you’re cooking, driving, or waiting in line. If you haven’t tried NotebookLM already, we encourage you to do so - it’s incredible. Among its many features, NotebookLM generates a podcast with natural sounding voices creating digestible and engaging audios of your uploaded documents and links.
Our AI content team has been looking a lot at TTS models, such as Nari Lab’s Dia. Interestingly, the TTS models we’ve been exploring don’t have a research paper - which makes sense given the small teams that are accomplishing these amazing feats. For example Nari Labs, which released Dia, only has two people who worked on the model and Chatterbox, which we are about to cover, is currently a three person team. We’re very excited about the progress made by these small but mighty teams.
Resemble AI recently launched their first open source TTS model with a MIT license. This model has been trending on Hugging Face since its release. What’s unique about this model is that it introduces a feature they call emotion exaggeration control. Feel free to play around with this adjustable exaggeration parameter in their demo.
Resemble AI acknowledges Cosyvoice, HiFT-GAN, and Llama 3 (now deprecated) as inspiration. Audio files generated by Chatterbox incorporate the PerTH Watermarker, allowing for detection of AI content.
The voice cloning ability of Chatterbox is very impressive. When testing, our team found that the voices bore remarkable similarity to our own and the generations were very impressive. For those interested in comparisons to ElevenLabs, AB testing is available on Podonos.
This article will cover two implementation options for using the Chatterbox TTS model:
Begin by setting up a DigitalOcean GPU Droplet, select AI/ML, and choose the NVIDIA H100 option.
Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.web console
Next, install the necessary software packages. In the web console, paste and run the following commands to install pip for managing Python packages and git-lfs for handling large files:
apt update
apt install python3-pip python3.10 git-lfs -y
Now, download the application code from Hugging Face and prepare its environment.
git-lfs clone https://huggingface.co/spaces/ResembleAI/Chatterbox
cd Chatterbox
python3 -m venv venv_chatterbox
source venv_chatterbox/bin/activate
pip3 install -r requirements.txt
pip3 install spaces
To make your Gradio app accessible over the internet, you need to make a small change to its source code.
Open the main application file in the Vim text editor:
vim app.py
Press the i key to enter INSERT mode. You’ll see -- INSERT -- at the bottom of the terminal. Then, locate the last line of the file, which likely looks something like demo.launch(). Modify it to include share=True:
demo.launch(share=True)
Press the ESC key to exit INSERT mode. Afterwards, type :wq and press Enter to save your changes and exit Vim.
You’re all set! Run the application with the following command:
python3 app.py
After the script initializes, you will see a public URL in the terminal output. Open this URL in your web browser to interact with your live Gradio application.
To create an audiobook, Chatterbox requires a short audio sample of the author’s voice to clone it effectively.
For optimal results, the Resemble AI Team recommends that audio recordings should be at least 10 seconds in duration and ideally in WAV format. Furthermore, the audio should have a 24k sample rate or higher, feature a single speaker with no background noise, and if possible, be recorded on a professional microphone. The content and speaking style are also important; the context of the spoken sentence should match the emotion in the audio file, and the reference clip’s speaking style should be similar to the desired output, such as using an audiobook-style clip for audiobook generation.
Check option 1 earlier in this tutorial for instructions on setting up a GPU Droplet, cloning the Chatterbox repo and setting up a virtual environment. The code snippet below can be pasted into the terminal to install the necessary packages.
pip3 install chatterbox-tts torchaudio
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Load the pre-trained Chatterbox model
model = ChatterboxTTS.from_pretrained(device="cuda") # Use "cpu" if CUDA is unavailable
# Define the text to be converted into speech
text = "Your audiobook text goes here."
# Specify the path to the reference audio sample
audio_prompt_path = "author_sample.wav"
# Generate the speech waveform
wav = model.generate(text, audio_prompt_path=audio_prompt_path)
# Save the generated audio to a file
ta.save("audiobook_segment.wav", wav, model.sr)
Replace “Your audiobook text goes here.” with the actual text from your audiobook and author_sample.wav
with the path to your reference audio file.
You can adjust the expressiveness and pacing of the synthesized voice using the exaggeration and cfg parameters:
exaggeration
: Controls emotional expressiveness. Higher values make the speech more dramatic.
cfg
(classifier-free guidance): Adjusts the adherence to the reference voice’s characteristics. Lower values can slow down the speech for clarity.
wav = model.generate(
text,
audio_prompt_path=audio_prompt_path,
exaggeration=0.7, # More expressive
cfg=0.3 # Slower, more deliberate pacing
)
Process each chapter or section of your audiobook individually, generating the corresponding audio files. Once all segments are synthesized, use an audio editing tool like Audacity to:
Chatterbox, developed by Resemble AI, is a recently released text-to-speech model with impressive voice cloning abilities and natural sounding voices. The model can be implemented in Gradio and can be incorporated for a variety of use cases (e.g. audiobooks). Chatterbox represents the significant progress we have made in personalized Voice AI.
Deepgram, an enterprise Voice AI platform, published a report “State of Voice AI 2025” which highlights trends around voice AI adoption. They make the case that 2025 is the year of the Voice AI Agent.
Check out one of our older tutorials which leverages Deepgram: “Building a Real-time AI Chatbot with Vision and Voice Capabilities using OpenAI, LiveKit, and Deepgram on GPU Droplets”
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.