Speech AI

Provide voice-based interfaces for your conversational AI applications.

What Is Speech AI?

Speech AI gives people the ability to converse with devices, machines, and computers to simplify and augment their lives. A subset of conversational AI, it includes automatic speech recognition (ASR) and text-to-speech (TTS) to convert the human voice into text and generate a human-like voice from written words—making powerful technologies like virtual assistants, real-time transcriptions, voice searches, and question-answering systems possible.

The benefits of using Speech AI.

World-Class Accuracy

Upgrade your customers' experiences to exceptional with the best-in-class accuracy that’s achieved with speech AI model customization.

Multiple Language Support

Broaden your customer base by offering voice-based applications in the languages your customers speak.

High Performance and Scalability

Serve more customers with low-latency, high-throughput applications that can instantly scale on any infrastructure: on premises, cloud, edge, or embedded.

A Unique, Natural Voice for Your Brand

Give your customer service a boost by delivering fast and meaningful engagements with your brand's unique voice.

Free Ebook: Building Speech AI Applications

Learn how to build and deploy real-time speech AI pipelines for your conversational AI application.

How Speech AI Is Being Used

Transcribe Multiple Speakers at Once

Modern speech-to-text algorithms transcribe meetings, lectures, and social conversations while identifying speakers and labeling their contributions. With NVIDIA speech AI technologies and SDKs, you can create accurate transcriptions for call center conversations and video conferencing meetings or automate clinical note-taking during physician-patient interactions.

Make Your Assistants Virtual

Virtual assistants communicate with users via a speech interface and assist with various tasks— from resolving customer issues in call centers, to turning on the TV as a smart home assistant, to navigating to the nearest gas station as an in-car intelligent assistant. Leverage NVIDIA Omniverse™ Avatar Cloud Engine (ACE) to integrate NVIDIA speech AI technologies for easy-to-use, deep-neural-network-based components into your interactive avatar applications to deliver accurate, fast, and natural interactions.

Brand Your Voice

With a recognizable brand voice, companies can create applications that build relationships with customers while supporting all customers, including those with speech and language deficits. With NVIDIA Custom Voice, part of NVIDIA speech AI, you can easily create a unique, high-quality voice personality for your brand in hours versus weeks and with as little as 30 minutes of recorded speech data.

Develop Customizable Speech AI Interfaces

Shorten Training by Using Pretrained Models

Modern speech AI systems use deep neural network (DNN) models trained on massive datasets. Over time, the size of speech AI models has grown so much that training such models can take weeks of intensive compute time, even when using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet, on high-performance GPUs.

NVIDIA speech AI offers pretrained, production-quality models in the NVIDIA NGCâ„¢ catalog that are trained on several public and proprietary datasets for over hundreds of thousands of hours on NVIDIA DGXâ„¢ systems.

Figure 1: Highly accurate pretrained models.

Figure 2: End-to-end TAO Toolkit workflow.

Customize Models for Higher Accuracy

Many enterprises have to customize speech AI models to achieve the desired accuracy for their specific conversational applications. However, customizing speech AI models from scratch usually requires large training datasets and AI expertise.

To speed up development and highly customize speech models without prior AI experience, you can use the NVIDIA TAO Toolkit, a low-code AI model development toolkit. It applies a proven transfer learning approach to a pretrained model and fine-tunes speech AI models for your use case. NVIDIA also offers NeMo, an open-source toolkit for researchers to build state-of-the-art (SOTA) speech AI models. Models optimized with NeMo and the TAO Toolkit can easily be exported and deployed in NVIDIA® Riva on premises or in the cloud as a speech service.

Achieve Natural Interactions by Developing Real-Time Skills

For speech AI skills, companies have always had to choose between accuracy and real-time performance. For example, they can’t ask a question and then wait several seconds for a response. In addition, they don’t want their conversational AI applications to misinterpret or produce gibberish.

With NVIDIA Riva, companies can achieve world-class accuracy and run their speech AI pipelines in real time—under a few milliseconds. Riva offers SOTA pretrained models on NGC, low-coding tools like the TAO Toolkit for fine-tuning to achieve world-class accuracy, and optimized skills for real-time performance.

Figure 3:  NVIDIA Riva speech AI skills capabilities.

Explore the Latest Breakthroughs in Speech AI

Speech AI Is Going Multilingual

Speech AI applications and pipelines must understand multiple languages, dialects, and accents to be deployed around the world. For example, people in the United States and most other countries speak different languages. In use cases like call centers, there are times when a customer uses more than one language to describe what's going on. The next step is to have speech AI applications  that can handle these situations.

Developers can use separate speech models for each language or a single model that can handle more than one language. Learn more on the Speech Recognition Collections page about ASR models in different languages.

Taking Speech AI From Cloud to Device

When companies first started using speech AI, everyone used cloud services because they’re easy to set up and use. Slowly, companies started switching to on-premises solutions to avoid privacy issues with their data. Now, on-device solutions are the latest breakthrough, not just for keeping data private but also for faster inference and cutting costs. 

NVIDIA Riva allows applications to be deployed in embedded, data center, and cloud environments to develop customizable speech AI interfaces for your conversational AI application.

The Conference for the Era of AI and the Metaverse

Developer Conference March 20-23 | Keynote March 21

Don't miss these three upcoming SpeechAI sessions at GTC.

Get Started With Speech AI

Get Access to Speech AI Workflows

Accelerate development time with packaged AI workflows for audio transcription and intelligent virtual assistants. Available with an NVIDIA Riva Enterprise software subscription, these AI workflows include full enterprise support and packaged NVIDIA AI frameworks and pretrained models, as well as resources such as Helm charts, Jupyter Notebooks, and documentation to help you jump-start building AI solutions.

Start Developing With Containers and Models

While large-scale deployments require NVIDIA Riva Enterprise software, NVIDIA also offers a variety of containers, models, and customization tools free of charge.

Access Educational Resources

Get an Introduction to Speech AI

Understand speech AI core concepts and how to build and deploy voice-technology application.

Demystify Speech AI

Learn how Speech AI technologies such as automatic speech recognition (ASR) and text-to-speech (TTS) automate millions of conversations today.

Browse Speech AI Blogs

Learn what speech AI is, how it has changed over time, about its key components, challenges, and use cases, and about NVIDIA Speech AI SDKs.

Take a Closer Look at NVIDIA Riva

Understand the key features of NVIDIA Riva that can help you build speech AI services.

Sign up to receive the latest speech AI news from NVIDIA.