BERT is a model for natural language processing developed by Google that learns bi-directional representations of text to significantly improve contextual understanding of unlabeled text across many different tasks.

It’s the basis for an entire family of BERT-like models such as RoBERTa, ALBERT, and DistilBERT.


What Makes BERT Different?

Bidirectional Encoder Representations from Transformers (BERT) was developed by Google as a way to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. It was released under an open-source license in 2018. Google has described BERT as the “first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus” (Devlin et al. 2018).

Bidirectional models aren’t new to natural language processing (NLP). They involve looking at text sequences both from left to right and from right to left. BERT’s innovation was to learn bidirectional representations with transformers, which are a deep learning component that attends over an entire sequence in parallel in contrast to the sequential dependencies of RNNs. This enables much larger data sets to be analyzed and models to be trained more quickly. Transformers process words in relation to all the other words in a sentence at once rather than individually, using attention mechanisms to gather information about the relevant context of a word and encoding that context in a rich vector that represents it. The model learns how a given word’s meaning is derived from every other word in the segment.

Previous word embeddings, like that of GloVe and Word2vec, work without context to generate a representation for each word in the sequence. For example, the word “bat” would be represented the same way whether referring to a piece of sporting gear or a night-flying animal. ELMo introduced deep contextualized representations of each word based on the other words in the sentence using a bi-directional long short term memory (LSTM). Unlike BERT, however, ELMo considered the left-to-right and right-to-left paths independently instead of as a single unified view of the entire context.

Because the vast majority of BERTs parameters are dedicated to creating a high-quality contextualized word embedding, the framework is considered to be very suitable for transfer learning. By training BERT on self-supervised tasks (ones in which human annotations are not required) like language modeling, massive unlabeled datasets such as WikiText and BookCorpus can be used, which comprise more than 3.3 billion words. To learn some other task, like question-answering, the final layer can be replaced with something suitable for the task and fine-tuned.

The arrows in the image below indicate the information flow from one layer to the next in three different NLP models.

Diagram showing information flow from one layer to the next in three different NLP models.
Image source: Google AI Blog

BERT models are able to understand the nuances of expressions at a much finer level. For example, when processing the sequence “Bob needs some medicine from the pharmacy. His stomach is upset, so can you grab him some antacids?” BERT is better able to understand that  “Bob,” “his”, and “him” are all the same person. Previously, the query “how to fill bob’s prescriptions” might fail to understand that the person being referenced in the second sentence is Bob. With the BERT model applied, it’s able to understand how all these connections relate.

Bi-directional training is tricky to implement because conditioning each word on both the previous and next words by default includes the word that’s being predicted in the multilayer model. BERT’s developers solved this problem by masking predicted words as well as other random words in the corpus. BERT also uses a simple training technique of trying to predict whether, given two sentences A and B, B is the antecedent of A or a random sentence.


Natural language processing is at the center of much of the commercial artificial intelligence research being done today. In addition to search engines, NLP has applications in digital assistants, automated telephone response, and vehicle navigation, to name just a few. BERT has been called a game-changer because it provides a single model trained upon a large data set that has been shown to achieve breakthrough results on a wide range of NLP tasks.

BERT’s developers said models can be adapted to a “wide range of use cases, including question answering and language inference, without substantial task-specific architecture modifications. BERT doesn’t need to be pre-trained with labeled data, so it can learn using any plain text.

KEY BENEFITS (use cases)

BERT can be fine-tuned for many NLP tasks. It’s ideal for language understanding tasks like translation, Q&A, sentiment analysis, and sentence classification.

Targeted search

While today’s search engines do a pretty good job of understanding what people are looking for if they format queries properly, there are still plenty of ways to improve the search experience. For people with poor grammar skills or who don’t speak the language of the search engine provider, the experience can be frustrating. Search engines also frequently require users to experiment with variations of the same query to find the one that delivers the best results.

An improved search experience that saves even 10% of the 3.5 billion searches people conduct on Google alone every day adds up to significant savings in time, bandwidth, and server resources. From a business standpoint, it also enables search providers to better understand user behavior and serve up more targeted advertising.

Better understanding of natural language also improves the effectiveness of data analytics and business intelligence tools by enabling non-technical users to retrieve information more precisely and cutting down on errors related to malformed queries.

Accessible navigation

More than one in eight people in the United States has a disability, and many are limited in their ability to navigate physical and cyberspace. For people who must use speech to control wheelchairs, interact with websites, and operate devices around them, natural language processing is a life necessity. By improving response to spoken commands, technologies like BERT can improve quality of life and even enhance personal safety in situations in which rapid response to circumstances is required.

Why BERT Matters to...

Machine Learning Researchers

Invoking change in natural language processing equated to that of AlexNet on computer vision, BERT is markedly revolutionary to the field. The ability to replace only the final layer of the network to customize it for some new task means that one can easily apply it to any research area of interest. Whether the goal is translation, sentiment analysis, or some new task yet to be proposed, one can rapidly configure that network to try it out. With over 8,000 citations to date, this model's derivatives consistently show up to claim state-of-the-art on language tasks.

Software Developer

With BERT, the computational limitations to put state-of-the-art models into production are greatly diminished due to the wide availability of pretrained models on large datasets. The inclusion of BERT and its derivatives in well-known libraries like Hugging Face also means that a machine learning expert isn't necessary to get the basic model up and running.

BERT has established a new mark in natural language interpretation, demonstrating that it can understand more intricacies of human speech and answer questions more precisely than other models.

Why BERT is Better on GPUs

Conversational AI is an essential building block of human interactions with intelligent machines and applications–from robots and cars to home assistants and mobile apps. Getting computers to understand human languages, with all their nuances, and respond appropriately has long been a “holy grail” of AI researchers. But building systems with true natural language processing (NLP) capabilities was impossible before the arrival of modern AI techniques powered by accelerated computing.

BERT runs on supercomputers powered by NVIDIA GPUs to train its huge neural networks and achieve unprecedented NLP accuracy, impinging in the space of known as human language understanding. While there have been many natural language processing approaches, human-like language ability has remained an elusive goal for AI. With the arrival of massive Transformer-based language models like BERT and GPUs as an infrastructure platform for these state-of-the-art models, we are now seeing rapid progress on difficult language understanding tasks. AI like this has been anticipated for many decades. With BERT, it has finally arrived.

Model complexity drives the accuracy of NLP, and larger language models dramatically advance the state-of-the-art in natural language processing (NLP) applications such as question-answering, dialog systems, summarization, and article completion. BERT-Base was created with 110 million parameters, while the expanded BERT-Large model involves 340 million parameters. Training is highly parallelized, which makes it a good use case for distributed processing on GPUs. BERT models have even been shown to scale well to huge sizes like the 3.9 billion parameter Megatron-BERT.

The complexity of BERT, as well as training on enormous datasets, requires massive performance. This combination needs a robust computing platform to handle all the necessary computations to drive both fast execution and accuracy. The fact that these models can work on massive unlabeled datasets have made them a hub of innovation for modern NLP and by extension a strong choice for the coming wave of intelligent assistants with conversational AI applications across many use cases.

The NVIDIA platform provides the programmability to accelerate the full diversity of modern AI including Transformer-based models. In addition, data center scale design, combined with software libraries and direct support for leading AI frameworks, provides a seamless end-to-end platform for developers to take on the most daunting NLP tasks.

In a test using NVIDIA’s DGX SuperPOD system based on a massive cluster of DGX A100 GPU servers connected with HDR InfiniBand NVIDIA achieved a record BERT training time of .81 minutes using the MLPerf Training v0.7 benchmark. By comparison, Google’s TPUv3 logged a time of more than 56 minutes on the same test.