Speech AI

Provide voice-based interfaces for your conversational AI applications.

What Is Speech AI?

Speech AI lets people converse with devices, machines, and computers to simplify and augment their lives. A subset of conversational AI, it includes automatic speech recognition (ASR) and text-to-speech (TTS) to convert voice into text and generate a human-like voice from written words—making powerful applications like virtual assistants, real-time transcriptions, and voice searches driven by large language models (LLMs) and retrieval-augmented generation (RAG) possible.

The Benefits of Using Speech AI

World-Class Accuracy

Upgrade your customers' experiences to exceptional with the best-in-class accuracy that’s achieved with speech AI model customization.

Multiple Language Support

Broaden your customer base by offering voice-based applications in the languages your customers speak.

Performance and Scalability

Serve more customers with low-latency, high-throughput applications that can instantly scale on any infrastructure: on premises, cloud, edge, or embedded.

Unique, Natural Voices

Give your customer service a boost by delivering fast and meaningful engagements with your brand's unique voice.

Free Ebook: Building Speech AI Applications

Learn how to build and deploy real-time speech AI pipelines for your conversational AI application.

Speech AI Day Sessions

Speech AI From Research to Production Fireside Chat

In this fireside chat, innovative leaders from Carnegie Mellon University, Hippocratic AI, Suno, and Wipro share insights on overcoming the challenges in deploying cutting-edge, multilingual speech technologies and emerging trends across industries.

Unveiling End-to-End Speech and Translation AI Magic

In this session, speakers from Motorola and Softserve discuss how to deliver the most accurate transcription, translation, and engaging voices for conversational AI experiences in a fast and scalable way.

Transform Your Business With Speech AI

Speakers from Deloitte, Kore.ai, and PolyAI share their insights, expertise, and success stories demonstrating speech AI's transformative power in action.

How Speech AI Is Being Used

Multi-Speaker Transcription

Transcribe Multiple Speakers at Once

Modern speech-to-text algorithms transcribe meetings, lectures, and social conversations in different languages while identifying speakers and labeling their contributions. With NVIDIA speech and translation AI technologies and SDKs, you can create accurate transcriptions for call center conversations and video conferencing meetings or automate clinical note-taking during physician-patient interactions for many different languages.

Virtual Assistant Applications

Make Your Assistants Virtual and Super Intelligent

Multilingual virtual assistants communicate with users via a speech interface to assist with diverse tasks—from resolving customer issues in call centers, to turning on the TV as a smart home assistant, to navigating to the nearest gas station as an in-car intelligent assistant. Build super intelligent virtual assistants and chatbots based on LLMs and RAG, or leverage NVIDIA Avatar Cloud Engine (ACE) to integrate NVIDIA speech and translation AI into your avatar applications for engaging interactions in many languages.

NVIDIA Custom Voice

Brand Your Voice

With a recognizable brand voice, companies can create multilingual applications that build relationships with customers in their own language while supporting all customers, including those with speech and language deficits. With NVIDIA Custom Voice, part of NVIDIA speech and translation AI, you can easily create a unique, high-quality voice personality for your brand in the language of your choice  in hours versus weeks and with as little as 30 minutes of recorded speech data.

Develop Customizable Speech AI Interfaces

Shorten Training by Using Pretrained Models

Modern speech AI systems use deep neural network (DNN) models trained on massive datasets. Over time, the size of speech AI models has grown so much that training such models can take weeks of intensive compute time, even when using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet, on high-performance GPUs.

NVIDIA speech and translation AI offers pretrained, production-quality models in the NVIDIA NGC™ catalog that are trained on several public and proprietary datasets for over hundreds of thousands of hours on NVIDIA DGX™ systems.

Figure 1: Highly accurate multilingual pretrained models.

Figure 2: End-to-end NVIDIA NeMo workflow.

Customize Models for Higher Accuracy

Many enterprises have to customize speech and translation AI models to achieve the desired multilingual accuracy for their specific conversational applications. However, customizing speech AI models from scratch usually requires large training datasets and AI expertise.

To speed up development and highly customize speech models, you can use NVIDIA NeMo™ to build, customize, and deploy speech—automatic speech recognition (ASR) and text-to-speech (TTS)—and natural language processing (NLP) pipelines. With NeMo you can customize, extend, and compose existing prebuilt speech AI modules to create new models. Models optimized with NeMo can easily be exported and deployed in NVIDIA® Riva on premises or in the cloud as a speech service.

Achieve Natural Interactions by Developing Real-Time Skills

For speech AI skills, companies have always had to choose between accuracy and real-time performance. For example, they can’t ask a question and then wait several seconds for a response. In addition, they don’t want their conversational AI applications to misinterpret or produce gibberish.

With NVIDIA Riva, companies can achieve world-class accuracy and run their speech and translation AI pipelines in real time—under a few milliseconds. Riva offers SOTA pretrained models on NGC that could be fine-tuned with NVIDIA NeMo to achieve world-class accuracy, and optimized skills for real-time performance.

Figure 3:  NVIDIA Riva speech AI skills capabilities.

Explore the Latest Breakthroughs in Speech AI

Speech AI Is Going Multilingual

Speech AI applications and pipelines must understand multiple languages, dialects, and accents to be deployed around the world. For example, people in the United States and most other countries speak different languages. In use cases like call centers, there are times when a customer uses more than one language to describe what's going on. The next step is to have speech AI applications  that can handle these situations.

Developers can use separate speech models for each language or a single model that can handle more than one language. Learn more on the Speech Recognition Collections page about ASR models in different languages.

Taking Speech AI From Cloud to Device

When companies first started using speech AI, everyone used cloud services because they’re easy to set up and use. Slowly, companies started switching to on-premises solutions to avoid privacy issues with their data. Now, on-device solutions are the latest breakthrough, not just for keeping data private but also for faster inference and cutting costs. 

NVIDIA Riva allows applications to be deployed in embedded, data center, and cloud environments to develop customizable speech AI interfaces for your conversational AI application.

Get Started With Speech AI

Get Started with Speech AI Workflows

Accelerate development time with packaged AI workflows, which include NVIDIA AI frameworks and pretrained models, as well as resources such as Helm charts, Jupyter Notebooks, and documentation to help you jump-start building AI solutions.

Start Developing With Containers and Models

While large-scale deployments require a purchase of NVIDIA Riva, NVIDIA also offers a variety of containers, models, and customization tools free of charge.

Access Educational Resources

Get an Introduction to Speech AI

Understand speech AI core concepts and how to build and deploy voice-technology application.

Demystify Speech AI

Learn how Speech AI technologies such as automatic speech recognition (ASR) and text-to-speech (TTS) automate millions of conversations today.

Browse Speech AI Blogs

Learn what speech AI is, how it has changed over time, about its key components, challenges, and use cases, and about NVIDIA Speech AI SDKs.

Take a Closer Look at NVIDIA Riva

Understand the key features of NVIDIA Riva that can help you build speech AI services.

Sign up to receive the latest speech AI news from NVIDIA.