Virtual Digital Assistant

A virtual digital assistant is a program that understands natural language and can answer questions or complete tasks based on voice commands.

What Is a Virtual Assistant?

Virtual digital assistants like Siri, Alexa, Google Home, and Cortana use conversational AI to recognize and respond to voice commands in order to carry out electronic tasks.  Conversational AI is the application of machine learning to develop language based apps that allow humans to interact naturally with devices, machines, and computers using speech. You use conversational AI when your virtual assistant wakes you up in the morning. You speak in your normal voice, the device understands, finds the best answer, and replies with speech that sounds natural.  

Virtual digital assistants are essentially voice-enabled front ends to cloud applications. The software is most often embedded in smartphones, tablets, desktop computers and, in some cases, dedicated devices. In most cases, the assistant is connected to the Internet to access the cloud-based back ends needed to recognize speech and perform queries. The technology behind conversational AI is complex, involving a multi-step process that requires a massive amount of computing power and computations that must happen in less than 300 milliseconds to deliver a great user experience.

Virtual personal assistants such as Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana are tuned to respond to simple requests without carrying context from one conversation to the next. A more specialized version of personal assistant is the virtual customer assistant, which understands context and is able to carry on a conversation from one interaction to the next. Another specialized form of conversational AI is virtual employee assistants, which learn the context of an employee’s interactions with software applications and workflows and suggest improvements. Virtual employee assistants are widely used in the popular new software category of robotic process automation.

Why Virtual Assistants and Conversational AI?

Demand for digital voice assistants is on the rise: Juniper Research firm estimates there will be 8 billion digital voice assistants in use by 2023, more than triple the 2.5 billion that were in use at the end of 2018. The shift toward working from home, telemedicine, and remote learning has created a surge in demand for custom, language-based AI services ranging from customer support to real-time transcriptions and summarization of video calls to keep people productive and connected.

Applications in conversational AI are growing every day, from voice assistants to question-answering systems that enable customer self-service. The range of industries adapting conversational AI into their solutions are wide, and have diverse domains extending from finance to healthcare. The technology is especially useful in situations in which using a screen or keyboard is inconvenient or unsafe, such as while driving a car. Virtual assistants are already ubiquitous in smart phones. As applications become mainstream and get deployed through devices in the home, car, and office, research from academia and industry for this space has exploded.

How Does Conversational AI work?

Virtual assistants require massive amounts of data and incorporate several artificial intelligence capabilities. Algorithms enable the assistant to learn from requests and improve contextual responses, such as providing answers based upon previous queries.

A typical conversational AI application uses three subsystems to do the steps of processing and transcribing the audio—understanding (deriving meaning) of the question asked, generating the response (text), and speaking the response back to the human. These steps are achieved by multiple deep learning solutions working together. First, automatic speech recognition (ASR) is used to process the raw audio signal and transcribing text from it. Second, natural language processing (NLP) or understanding (NLU)  is used to derive meaning from the transcribed text (ASR output). Last, speech synthesis or text-to-speech (TTS) is used for the artificial production of human speech from text. Optimizing this multi-step process is complicated, as each of these steps requires building and using one or more deep learning models.

Deep learning models are applied for NLU because of their ability to accurately generalize over a range of contexts and languages. Transformer deep learning models, such as BERT (Bidirectional Encoder Representations from Transformers), are an alternative to recurrent neural networks that applies an attention technique—parsing a sentence by focusing attention on the most relevant words that come before and after it. BERT revolutionized progress in NLU by offering accuracy comparable to human baselines on benchmarks for question answer (QA), entity recognition, intent recognition, sentiment analysis, and more.

GPUs: Key to Conversational AI

Conversational requires a massive amount of computing power and needs to deliver results in less than 300 milliseconds.

A GPU is composed of hundreds of cores that can handle thousands of threads in parallel. GPUs have become the platform of choice to train deep learning models and perform inference because they can deliver 10X higher performance than CPU-only platforms.

The difference between a CPU and GPU.

NVIDIA GPU-Accelerated Conversational AI tools

Deploying a service with conversation AI can seem daunting, but NVIDIA  has tools to make this process easier, including a new technology called NVIDIA Jarvis.

NVIDIA Jarvis, a tool that helps deploy conversational AI.

NVIDIA Jarvis is a GPU-accelerated application framework that allows companies to use video and speech data to build state-of-the-art conversational AI services customized for their own industry, products, and customers.

This framework offers an end-to-end deep learning pipeline for conversational AI. It includes state-of-the-art deep learning models, such as NVIDIA’s Megatron BERT for natural language understanding. Enterprises can further fine-tune these models on their data using NVIDIA NeMo, optimize for inference using NVIDIA® TensorRT, and deploy in the cloud and at the edge using Helm charts available on NVIDIA GPU Cloud™ (NGC), NVIDIA’s catalog of GPU-optimized software.

Applications built with Jarvis can take advantage of innovations in the new NVIDIA A100 Tensor Core GPU for AI computing and the latest optimizations in NVIDIA TensorRT for inference. This makes it possible to run an entire multimodal application, using the most powerful vision and speech models, faster than the 300-millisecond threshold for real-time interactions.  

Jarvis Use Cases

Companies worldwide are using NVIDIA’s conversational AI platform to improve their services.

Voca’s AI virtual agents—which use NVIDIA for faster, more interactive, human-like engagements—are used by Toshiba, AT&T, and other world-leading companies. Voca uses AI to understand the full intent of a customer’s spoken conversation and speech. This makes it possible for the agents to automatically identify different tones and vocal clues to discern between what a customer says and what a customer means. Additionally, they can use scalability features built into NVIDIA’s AI platform to dramatically reduce customer wait time.

Kensho, the innovation hub for S&P Global located in Cambridge, Mass. that deploys scalable machine learning and analytics systems, has used NVIDIA’s conversational AI to develop Scribe, a speech-recognition solution for finance and business. With NVIDIA, Scribe outperforms other commercial solutions on earnings calls and similar financial audio in terms of accuracy by a margin of up to 20 percent.

Square has created an AI virtual assistant that allows Square sellers to use AI to automatically confirm, cancel, or change appointments with their customers. This frees them to conduct more strategic customer engagement. With GPUs, Square is able to train models 10X faster versus CPUs to deliver more accurate, human-like interactions.

Next Steps

To learn more refer to:

Find out more: 

  • GPU-accelerated data centers can deliver unprecedented performance with fewer servers, less floor space, and reduced power consumption. The NVIDIA GPU Cloud provides extensive software libraries at no cost, as well as tools for building high-performance computing environments that take full advantage of GPUs.
  • NVIDIA CUDA-X AI software acceleration libraries use GPUs in machine learning (ML) to accelerate workflows and realize model optimizations.
  • The RAPIDS suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs.
  • Widely used deep learning frameworks such as MXNet, PyTorch, TensorFlow, and others rely on NVIDIA GPU-accelerated libraries to deliver high-performance, multi-GPU accelerated training.
  • The NVIDIA Deep Learning Institute (DLI) offers instructor-led, hands-on training on the fundamental tools and techniques for building Transformer-based natural language processing models for text classification tasks like categorizing documents.