Conversational AI

Conversational AI is a complex form of artificial intelligence that uses a combination of technologies to enable human-like interactions between computers and people. The most sophisticated systems can recognize speech and text, understand intent, recognize language-specific idioms and aphorisms, and respond in appropriate natural language.

What Is Conversational AI?

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech. 

You use conversational AI when your virtual assistant wakes you up in the morning, when asking for directions on your commute, or when communicating with a chatbot while shopping online. You speak in your normal voice and the device understands, finds the best answer, and replies with speech that sounds natural. 

Applications of conversational AI come in several forms. The simplest is FAQ bots, which are trained to respond to queries—usually expressed in writing—from a defined database of pre-formatted answers. A more complex form of conversational AI is virtual personal assistants such as Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana. These engines are tuned to respond to simple requests.

A more specialized version of personal assistant is the virtual customer assistant, which understands context and is able to carry on a conversation from one interaction to the next. Another specialized form of conversational AI is virtual employee assistants, which learn the context of an employee’s interactions with software applications and workflows and suggest improvements. Virtual employee assistants are widely used in the popular new software category of robotic process automation.

Why Conversational AI?

Conversational AI is an essential building block of human interactions with intelligent machines and applications–from robots and cars to home assistants and mobile apps. Getting computers to understand human languages, with all their nuances, and respond appropriately has long been a “holy grail” of AI researchers. But building systems with true natural language processing (NLP) capabilities was impossible before the arrival of modern AI techniques powered by accelerated computing.

In the last few years, deep learning has improved the state-of-the-art in conversational AI and offered superhuman accuracy on certain tasks. Deep learning has also reduced the need for deep knowledge of linguistics and rule-based techniques for building language services, which has led to widespread adoption across industries like retail, healthcare, and finance.

Demand for advanced conversational AI tools is on the rise.  An estimated 50 percent of searches will be conducted with voice by 2020 and, by 2023, there will be 8 billion digital voice assistants in use.

How Does Conversational AI work?

Responding to a question involves several steps: converting a user’s speech to text, understanding the text’s meaning, searching for the best response to provide in context, and providing that response with a text-to-speech tool. Typically, the conversational AI pipeline consists of three stages:

  • Automatic Speech Recognition (ASR)
  • Natural Language Processing (NLP) or Natural Language Understanding (NLU)
  • Text-to-Speech (TTS) with voice synthesis

Each of these steps requires running multiple AI models—so the time available for each individual network to execute is around 10 milliseconds or less.

Automatic speech recognition (ASR) takes human voice as input and converts it into readable text. Deep learning has replaced traditional statistical methods, such as Hidden Markov Models and Gaussian Mixture Models, as it offers higher accuracy when identifying phonemes.

Automatic speech recognition (ASR) takes human voice as input and converts it into readable text.

Natural language understanding (NLU) takes text as input, understands context and intent, and generates an intelligent response. Deep learning models are applied for NLU because of their ability to accurately generalize over a range of contexts and languages. Transformer deep learning models, such as BERT (Bidirectional Encoder Representations from Transformers), are an alternative to recurrent neural networks that applies an attention technique—parsing a sentence by focusing attention on the most relevant words that come before and after it. BERT revolutionized progress in NLU by offering accuracy comparable to human baselines on benchmarks for question answer (QA), entity recognition, intent recognition, sentiment analysis, and more.  

Automatic speech recognition (ASR) takes human voice as input and converts it into readable text.

The last stage of the conversational AI pipeline involves taking the text response generated by the NLU stage and changing it to natural-sounding speech. Vocal clarity is achieved using deep neural networks that produce human-like intonation and a clear articulation of words. This step is accomplished with two networks—a synthesis network that generates a spectrogram from text and a vocoder network that generates a waveform from the spectrogram.

Last stage of the conversational AI pipeline involves taking the text response generated by the NLU stage and changing it to natural-sounding speech.

GPUs: Key to Conversational AI

The technology behind conversational AI is complex, involving a multi-step process that requires a massive amount of computing power and computations that must happen in less than 300 milliseconds in order to deliver a great user experience.

A GPU is composed of hundreds of cores that can handle thousands of threads in parallel. This has made GPUs the platform of choice to train deep learning models and perform inference because they can deliver 10X higher performance than CPU-only platforms.

With NVIDIA GPUs and NVIDIA® CUDA-X AI libraries, massive, state-of-the-art language models can be rapidly trained and optimized to run inference in just a couple of milliseconds—or thousandths of a second. This is a major stride towards ending the trade-off between an AI model that’s fast versus one that’s large and complex.

In addition, transformer-based deep learning models like BERT don’t require sequential data to be processed in order, allowing for much more parallelization and reduced training time on GPUs than RNNs. 

The difference between a CPU and GPU.

The most advanced conversational AI technologies are being accelerated with NVIDIA GPUs:

  • Automatic Speech Recognition (ASR): Kaldi is a C++ toolkit that supports traditional methods and popular deep learning models for ASR. GPU-accelerated Kaldi solutions can perform 3500X faster than real time audio and 10X faster than CPU-only options.
  • Natural Language Understanding (NLU): The parallel-processing capabilities and Tensor Core architecture of NVIDIA GPUs allow for higher throughput and scalability when working with complex language models—enabling record-setting performance for both the training and inference of BERT. GPU-accelerated BERT-base can perform inference 17X faster with NVIDIA T4 than CPU-only solutions. The ability to use unsupervised learning methods, transfer learning with pre-trained models, and GPU acceleration has enabled widespread adoption of BERT in the industry. To work toward the goal of truly conversational AI, language models are getting larger over time. Future models will be many times bigger than those used today, so NVIDIA built and open-sourced the largest Transformer-based AI yet: GPT-2 8B, an 8.3 billion-parameter language processing model that’s 24x bigger than BERT-Large.
  • Text-to-Speech (TTS):  Popular text-to-speech deep learning models—GPU-accelerated Tacotron2 and Waveglow—can perform inference 9X faster with an NVIDIA T4 GPU than CPU-only solutions.

Conversational AI Use Cases

GPU-optimized language understanding models can be integrated into AI applications for industries like healthcare, retail, and financial services, powering advanced digital voice assistants in smart speakers and customer service lines. These high-quality conversational AI tools let businesses across sectors provide a previously unattainable standard of personalized service when engaging with customers.


One of the difficulties facing health care is making it easily accessible. Calling your doctor’s office and waiting on hold is a common occurrence, and connecting with a claims representative can be equally difficult. The implementation of natural language processing (NLP) to train chatbots is an emerging technology within healthcare to address the shortage of healthcare professionals and open the lines of communication with patients.

Another key healthcare application for NLP is in biomedical text mining—or BioNLP. Given the large volume of biological literature and the increasing rate of biomedical publications, natural language processing is a critical tool in extracting information within the studies published to advance knowledge in the biomedical field, aiding drug discovery and disease diagnosis.

Financial Services

Natural language processing (NLP) is a critically important part of building better chatbots and AI assistants for financial service firms. Among the numerous language models used in NLP-based applications, BERT has emerged as a leader and language model for NLP with machine learning. Using AI, NVIDIA has recently broken records for speed in training BERT, which promises to help unlock the potential for billions of expected conversational AI services coming online in the coming years to operate with human-level comprehension. For example, banks can use NLP to assess the creditworthiness of clients with little or no credit history.


Chatbot technology is also commonly used for retail applications to accurately analyze customer queries, and generate responses or recommendations. This streamlines the customer journey and improves efficiencies in store operations. NLP is also used for text mining customer feedback and sentiment analysis.

Benefits of Conversational AI

There are many answers to this. One is that humans are expensive. While costs vary widely, the fully loaded cost of a customer service call ranges from $2.70 to $5.60, according to F. Curtis Barry & Co., and other estimates have placed the average price at about one dollar per minute. Replacing human operators with bots has obvious cost-saving benefits. Research has also shown that many people are more comfortable conversing with a computer than with a sales or customer service agent, making conversational AI an enabler of customer self-service.

Conversational AI is also more appropriate than keyboard interactions in many scenarios—such as when a person is driving a car or otherwise occupied or when keyboards aren’t an option at all, such as in elevators.

The core technology can also be used to interpret or refine vague queries or to interpret queries by people who speak a different language.

Gartner believes 85% of customer relationships with enterprises can be handled without human interaction and McKinsey & Co. has estimated that one-third of activities in about 60% of occupations worldwide could make use of the technology.

NVIDIA GPU-Accelerated Conversational AI tools

Deploying a service with conversation AI can seem daunting, but NVIDIA has tools to make this process easier, including Neural Modules (NeMo for short) and a new technology called NVIDIA Jarvis. To save time, pretrained models, training scripts, and performance results are available on the NVIDIA GPU Cloud (NGC) software hub.

NVIDIA Jarvis.NVIDIA Jarvis is a GPU-accelerated application framework that allows companies to use video and speech data to build state-of-the-art conversational AI services customized for their own industry, products, and customers.

Jarvis offers an end-to-end deep learning pipeline for conversational AI. It includes state-of-the-art deep learning models, such as NVIDIA’s Megatron BERT for natural language understanding. Enterprises can further fine-tune these models on their data using NVIDIA NeMo, optimize for inference using NVIDIA TensorRT , and deploy in the cloud and at the edge using Helm charts available on NGC, NVIDIA’s catalog of GPU-optimized software.

Applications built with Jarvis can take advantage of innovations in the new NVIDIA A100 Tensor Core GPU for AI computing and the latest optimizations in NVIDIA TensorRT for inference. This makes it possible to run an entire multimodal application, using the most powerful vision and speech models, faster than the 300-millisecond threshold for real-time interactions.  

NVIDIA GPU-Accelerated, End-to-End Data Science

The RAPIDS suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs. 

Data preparation, model training, and visualization.

NVIDIA GPU-Accelerated Deep Learning Frameworks

GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks, and provide interfaces to commonly used programming languages such as Python and C/C++. Widely used deep learning frameworks such as MXNet, PyTorch, TensorFlow, and others rely on NVIDIA GPU-accelerated libraries to deliver high-performance, multi-GPU accelerated training.

NVIDIA GPU-accelerated libraries.

The Future of Conversational AI on the NVIDIA Platform

What drives the massive performance requirements of Transformer-based language networks like BERT and GPT-2 8B is their sheer complexity as well as pre-training on enormous datasets. The combination needs a robust computing platform to handle all the necessary computations to drive both fast execution and accuracy. The fact that these models can work on massive unlabeled datasets have made them a hub of innovation for modern NLP and, by extension, a strong choice for the coming wave of intelligent assistants with conversational AI applications across many use cases.

The NVIDIA platform with its Tensor Core architecture provides the programmability to accelerate the full diversity of modern AI, including Transformer-based models. In addition, data center-scale design and optimizations of the DGX SuperPOD, combined with software libraries and direct support for leading AI frameworks, provides a seamless end-to-end platform for developers to take on the most daunting NLP tasks.

Continuous optimizations to accelerate training of BERT and Transformer for GPUs on multiple frameworks are freely available on NGC, NVIDIA’s hub for accelerated software.

NVIDIA TensorRT includes optimizations for running real-time inference on BERT and large Transformer based models. To learn more, check out our “Real-Time BERT Inference for Conversational AI” blog. NVIDIA’s BERT GitHub repository also has code today to reproduce the single-node training performance quoted in this blog, and in the near future, the repository will be updated with the scripts necessary to reproduce the large-scale training performance numbers. For the NVIDIA research team’s NLP code on Project Megatron, head over to the Megatron Language Model GitHub repository.

Next Steps

To learn more refer to:

Find out about how:

  • Developers can select a Jarvis pre-trained model from NVIDIA’s NGC catalog, fine-tune it using their own data with the NVIDIA Transfer Learning Toolkit, optimize it for maximum throughput and minimum latency in real-time speech services, and then easily deploy the model with just a few lines of code so there is no need for deep AI expertise.
  • GPU-accelerated data centers can deliver unprecedented performance with fewer servers, less floor space, and reduced power consumption. The NVIDIA GPU Cloud provides extensive software libraries at no cost, as well as tools for building high-performance computing environments that take full advantage of GPUs.
  • Widely used deep learning frameworks such as MXNet, PyTorch, TensorFlow, and others rely on NVIDIA GPU-accelerated libraries to deliver high-performance multi-GPU accelerated training.
  • The NVIDIA Deep Learning Institute offers instructor-led, hands-on training on the fundamental tools and techniques for building Transformer-based natural language processing models for text classification tasks, such as categorizing documents.