CUDA Spotlight: Ian Lane
GPU-Accelerated Speech Recognition
This week's Spotlight is on Dr. Ian Lane. Ian is an Assistant Research Professor at Carnegie Mellon University where he leads a speech and language processing research group based in Silicon Valley. He co-directs the CMU CUDA Center of Excellence with Dr. Kayvon Fatahalian.
Ian and his team are actively conducting research in hardware-accelerated perceptual computing, currently focusing on GPU-accelerated methods for speech recognition, image processing and natural language processing understanding.
This interview is part of the CUDA Spotlight Series.
[Editor's Note: Learn more about GPU-accelerated speech recognition at the GPU Technology Conference. Carnegie Mellon Ph.D. candidate Wonkyum Lee will speak on GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search, on Wed., March 26.]
Q & A with Ian Lane
NVIDIA: Ian, what is Speech Recognition?
Speech Recognition spans many research fields, including signal processing, computational linguistics, machine learning and core problems in computer science, such as efficient algorithms for large-scale graph traversal.
Speech Recognition also is one of the core technologies required to realize natural Human Computer Interaction (HCI). It is becoming a prevalent technology in interactive systems being developed today.
NVIDIA: What are examples of real-world applications?
NVIDIA: Why accelerate Speech Recognition? What kinds of new applications could be enabled?
In the past, we would often build one system optimized for accuracy (that could be up to 10x slower than real-time), and a separate system optimized for real-time Speech Recognition. The real-time Speech Recognition would generally have lower accuracy in order to perform recognition at a faster speed. With recent GPU technologies, however, we have demonstrated that this tradeoff no longer has to be made.
By leveraging both the CPU and GPU processors during the Speech Recognition process, we can perform recognition using large, and in some cases multiple models, obtaining high accuracy even on embedded and mobile systems.
Additionally, pre-recorded speech or multimedia content can be transcribed much more quickly. Compared to a single-thread CPU implementation, which is the standard architecture used in Speech Recognition today, we are able to perform recognition up to 33x faster. For some content this allows us to transcribe 30 minutes of audio in less than one minute. It is very exciting to see how GPUs are opening up new possibilities because of the speed they enable.
NVIDIA: Tell us about the HYDRA project.
With colleagues Dr. Jike Chong and Jungsuk Kim, we developed a new architecture for Speech Recognition specifically optimized for CUDA-enabled GPUs. Our proposed architecture efficiently leverages both GPU and CPUs to achieve very large vocabulary Speech Recognition (over one million words) at up to 30x faster than CPU-only approaches.
The HYDRA project involves exposing the fine-grained concurrency across all components of Speech Recognition and efficiently mapping them onto the parallelism available on the latest multi-code CPU and GPU architectures. Each sub-component of the process, which includes feature extraction, acoustic model computation, language model lookup and graph search, was mapped to the appropriate compute architecture to allow for efficient execution.
One of the breakthroughs in this project was the execution of the core Speech Recognition search and language model lookup steps in parallel on the GPU and CPU cores respectively. This allowed us to perform Speech Recognition with extremely large language models (over one billion parameters), with little degradation to the speed in which Speech Recognition was performed on the GPU. An overview of this architecture is shown below.
Illustration showing the Hybrid CPU-GPU architecture used in HYDRA. Speech Recognition is performed on the GPU with a fully composed search graph with limited language model history. Extended language model context is introduced on-the-fly during the search process by performing model lookup on the CPU. Using this approach we are able to perform large vocabulary speech recognition up to 30x faster than a CPU-based implementation.
NVIDIA: How does HYDRA leverage GPU computing?
The acoustic model computation can be performed with a deep neural network, which computes the likelihood of phonetic units based on an acoustic observation. This computation involves performing multiple large matrix-multiply operations. The GPU performs extremely well for this task due to its high level of parallelism and fast block memory access.
The second computation involves performing a time-synchronous Viterbi search over a weighted-Finite-State-Transducer (wFST). In Speech Recognition, this wFST network combines models for both word pronunciations and word sequences into a single search graph. The search process generates the most likely word sequence for the acoustic observations up until this point. The graphs used in Speech Recognition often contain tens of millions of states and hundreds of millions of arcs.
Compared to the acoustic model computation, the time-synchronous Viterbi search is very communication-intensive. At any point during the search, hundreds of thousands of competing arcs are evaluated, requiring extremely diverse memory access.
NVIDIA: What challenges did you face in parallelizing the algorithms?
On re-convergent paths, multiple lists are must be merged into a single list of the most likely hypothesis candidates. This list must be sorted to retain the top n-best candidates to pass forward during search.
Performing a list merge followed by sort as blocking functions is very inefficient on the GPU, especially as this process is performed tens of thousands of times per second. To overcome this challenge we developed a novel method based on an "Atomic Merge-and-Sort" operation, which enabled n-best lists to be merged atomically on GPU platforms. Our method heavily uses the 64-bit atomic Compare-And-Swap (atomicCAS) operation, which is implemented in hardware on modern CUDA-enabled GPUs. An overview of our proposed algorithm for "Atomic Merge-and-Sort" is shown below:
Second, we found that significant improvements in throughput could be obtained by stacking features together during forward propagation through the deep neural network acoustic model.
By stacking N frames together we can use a single sequence of matrix-matrix computations for this step rather than repeating the forward propagation N times using vector-matrix operations. We observed the time taken for computation with N=16 was almost equivalent to the N=1 case, allowing us to perform acoustic model computation 16x more efficiently using a very simple technique.
NVIDIA: Which CUDA features and GPU libraries do you use?
NVIDIA: What other applications are you using GPUs for in your research?
We are able to train these models much more efficiently using a handful of GPUs rather than our traditional CPU clusters. For example, an acoustic model that would typically take more than 1000 CPU-hours to train on a CPU cluster can be trained in under ten hours on a single GPU.
NVIDIA: What are you most excited about, in terms of near-term advancements?
While these solutions are not possible today, GPU performance on mobile and embedded platforms is improving dramatically. In a few years the compute power on these platforms will match what we have on desktop PCs today. This will allow us to do much more processing on a local, embedded platform and enable us to explore new methods for robust Speech Recognition.
Bio for Ian Lane
Ian Lane is an Assistant Professor at Carnegie Mellon University. He leads the speech and language-processing group at CMU Silicon Valley and performs research in the areas of Speech Recognition, spoken language understanding and speech interaction.
Ian and his group are developing methods to accelerate speech and language technologies using General Purpose Graphics Processing Units (GPUs). His group has already obtained 1000x speedup for signal processing tasks, 100x speedup for Viterbi training and over 20x speedup for complex tasks such as graph search. These new technologies have enabled the group to explore novel interaction paradigms for human machine interaction. Ian earned his M.S. and Ph.D. from Kyoto University, Japan and B.S. from Massey University, New Zealand.