Multimodal Large Language Models

Multimodal large language models (MLLMs) are deep learning algorithms that can understand and generate various forms of content ranging across text, images, video, audio, and more.

What Are Multimodal Large Language Models?

MLLMs expand the capabilities of traditional large language models (LLMs), which are primarily focused on processing and generating text. By integrating multiple types of data, MLLMs enable more complex and versatile applications that require the synthesis and interpretation of both textual and nontextual information.

This means that MLLMs can interpret a variety of data domains, including:

Sensory data: Information from motion sensors, GPS, or other tracking devices.
3D models: Spatial data and 3D representations used in design, gaming, or simulations.
Structured data: Data formats like spreadsheets or databases that require numerical or categorical interpretation.
Mixed documentation: Webpages or documents that combine text, code, images, and multimedia elements.

Why Are MLLMs Important?

The world is multimodal, and human interaction with digital content isn't limited to text.

MLLMs reflect this diversity with their ability to ingest, understand, and generate many data types, making AI interactions more natural and effective.

MLLMs are important, as they help AI tools bridge the gap between human interaction and technology. The capabilities to understand and interpret different modalities of data give rise to more impactful and fascinating applications in everyday life. For example, health care providers can leverage MLLMs to help evaluate a patient’s X-rays and medical files, then suggest personalized treatments or determine possible diagnoses for a patient.

The value of these models reaches far beyond health care. MLLMs can ingest data from nuanced documentation such as PDFs with a variety of data—including diagrams, charts, and images. This opens up use cases in everything ranging from education to enterprises, where employees can leverage a chatbot to improve their workflows and productivity.

How Do MLLMs Work?.

Data ingestion and processing: When an MLLM receives input, such as a chart (a combination of caption text and an image), each model has different ways of encoding the data. One of which is where unimodal encoding occurs to produce embeddings for each modality. This means that the text and image data are converted and then embedded as vectors so they can properly be processed through the model.
Embedding, alignment, and fusion: These embeddings are then aligned and‌ fused into a unified multimodal representation. This step is crucial to prepare the data in a manner that facilitates future cross-modal understanding.
Learning cross-modal relationships: Through training on diverse datasets that include examples of how different modalities relate to each other, MLLMs learn to understand and generate content that reflects the complex interconnections between text, images, audio, and other data types.
Output generation: Depending on the task, the model may use a decoder (e.g., an image decoder for generating images or a language decoder for text generation) to produce the desired output. As an example, the output can be purely textual, purely visual, or a combination of modalities.

Similar to LLMs, MLLMs apply self-attention mechanisms (which compute attention scores) that reflect how relevant different parts of the input data are to other parts. In an MLLM, self-attention empowers the model to see how characters in the text (one type of modality) relate to parts of an image (another modality). Because the self-attention mechanism doesn't capture sequence order, positional encodings are needed to understand the meaning of data in a sequence (such as temporal sequences in video data). Without this, the model would interpret the data in an unordered manner, potentially causing it to lose its meaning.

The MLLM training process can require extensive computational resources due to large neural networks, which have billions of parameters. To manage this degree of complexity, data and model parallelism empower the computational workload to be separated among different GPUs to create an efficient training process.

Stitching ‌Modalities From Different Domains

Because an MLLM can process multiple modalities, there needs to be a way for all these modalities to be combined. Encoders help make this integration possible.

For each modality, specific encoders are used to transform that type of input data (e.g., text, images, audio) into embeddings in a shared high-dimensional vector space. The embeddings for each modality are combined into a joint embedding space, to easily transform from one embedding to another.

An example of an encoder and decoder in the scenario of MLLMs are with the audio encoder and image decoder. Most simply, the encoder will capture contextual information and translate input data into embeddings, while the decoder takes those embedded representations and generates them in the target modality. An audio encoder will take a voice recording and translate that into a series of feature vectors embedded in the vector space representing the voice recording. The image decoder would take the specific image embeddings from the joint embedding space and generate the desired output.

What Is the Difference Between LLMs and MLLMs?

LLMs trained on large quantities of textual data powered the initial wave of generative AI tools. This allowed LLMs to effectively generate articles, emails, and code snippets because of their ability to map responses they had seen before in novel queries.

While LLMs serve as the brain—providing language and context understanding—MLLMs go beyond the status quo of typical LLMs as they power the generation of a broad range of data modalities.

Exploring the Differences

The shift from traditional LLMs to their more advanced multimodal counterparts involves changes not only in capabilities but also in underlying architecture, applications, and training and fine-tuning approaches.

	Large Language Models	Multimodal Large Language Models
Data Process	Text-only data encoded by text tokenizers	Multimodal data that requires separate encoders for each modality
Model Architecture	A single-transformer architecture	Separate encoders for each modality, followed by a fusion module that projects the encoded representations into a unified embedding space
Training Objective	Language modeling objectives like next-token prediction	Often uses contrastive learning objectives that aim to align the representations of different modalities
Inference Computation Complexity	Quadratic to the input sequence length	The complexity of text-only LLMs with the added computation from encoding inputs and decoding outputs from multiple modalities simultaneously
Modality Encoders	Only text data gets processed, so no need for modality encoders	Convert images, audio, and other nontext data into embeddings that reflect the content’s meaning
Input Projector	Typically process textual embeddings directly derived from text data alone, without the need for alignment from other modalities	Aligns the encoded representations from various modalities with text data to create a unified input that the language model can process
LLM Backbone	Processes textual data	Processes aligned multimodal inputs using pretrained knowledge to perform tasks such as reasoning, comprehension, and content generation
Output Projector	No need to convert embeddings back to other modalities	Maps the model’s output embeddings back to the target modality for generating nontextual outputs
Modality Generator	Models don’t have modality generators because they’re not designed to handle nontextual outputs	Produces outputs in individual modalities, typically using latent diffusion models (LDMs)

Model Architecture

LLMs are typically built on a single-transformer architecture, optimized for processing sequential data and managing long-range text dependencies. This architecture enables LLMs to proficiently understand and generate language. In contrast, MLLMs employ a more complex design that includes separate encoders for each modality, such as transformers for text and convolutional neural networks (CNNs) for images. These separate encoders capture and encode information specific to each modality. Following this, a fusion module integrates these encoded representations into a unified embedding space. This architecture allows MLLMs to seamlessly incorporate features from diverse data types, facilitating a holistic understanding of multimodal input data.

Training Process

LLMs and MLLMs have distinct training processes. As a traditional text-based model, LLMs are trained using extensive datasets consisting of books, articles, and webpages. The goal is to teach the model to predict the next word in a sequence, enabling it to generate coherent and contextually relevant text. This process begins with data collection and preprocessing to clean and prepare the data for training. Transformers are the preferred architecture for LLMs due to their effectiveness in handling sequential data and long-range dependencies.

On the other hand, MLLMs, such as GPT-4V, are designed to learn from multiple data types, like images and text. This training is more complex, as it involves linking different modalities, such as associating the image of a dog with the word “dog” or generating descriptive text for an economic chart. The training process for MLLMs integrates techniques like CNNs for image processing with transformers for text, ensuring the model can handle and integrate features from both modalities effectively.

Computational Requirements

Because of their architectural differences, the computational demands of training LLMs versus MLLMs also vary. LLMs require substantial GPU resources to manage the large-scale parallel processing needed for handling billions of parameters and large datasets. The self-attention mechanism in transformers, with its quadratic complexity, further increases these requirements. MLLMs, however, have even higher computational demands. Besides the challenges associated with transformers, they need additional processing power for CNNs used in image handling. The integration of different modalities often involves cross-attention techniques, adding to the computational load. Therefore, training MLLMs typically calls for more sophisticated hardware configurations and innovations in model architecture to optimize efficiency.

What Are the Challenges of MLLMs?

Exploring the complexities of MLLMs reveals a range of challenges, from architectural intricacies to the nuances of data management and computational demands.

Some of the key challenges include:

Model architecture and training:
- Creating model architectures that can efficiently process and generate multimodal content while maintaining scalability is a complex task. Balancing the trade-offs between model capacity, performance, and resource requirements poses a significant challenge.
- Training MLLMs on diverse multimodal data requires substantial computational resources, data storage, and time to iterate, making the training process costly.
Data representation: One of the core challenges in multimodal machine learning is the representation of data from multiple modalities in a unified form. Each modality has its own unique characteristics, data formats, and underlying structures. Integrating these disparate modalities into a unified representation that captures the richness and complexity of each while enabling effective interaction across modalities is inherently complex.
Data collection and curation: Gathering and curating large, high-quality, and diverse multimodal datasets is resource-intensive and presents privacy and ethical challenges. The absence of comprehensive, well-labeled multimodal datasets hinders the training of robust MLLMs.
Fusion: Modality fusion is the process of integrating information from various modalities to form a coherent representation the system can use for response generation. The challenge is that this integration must be performed in a manner that preserves the integrity of each modality’s information while also making sure the combined data is easily understandable. Cross-attention mechanisms are one common approach to achieve meaningful fusion by capturing cross-modal interactions, allowing the system to generate outputs that properly depict the interrelations of the different data types.
Factual accuracy and bias: MLLMs can hallucinate, which is a significant issue for applications that require high accuracy and reliability. They can also inherit and amplify biases present in their training data and/or base LLM.
Generalization: While MLLMs can perform impressively on tasks and datasets they’ve been trained or fine-tuned on, generalizing to real-world scenarios or unseen modalities can still be challenging.
Privacy: Handling diverse data types such as images, audio, and video raises privacy concerns, as MLLMs might inadvertently reveal sensitive information contained in the training data.

How Are MLLMs Being Applied?

With the opportunity to process various modalities through MLLMs, there are incredible applications across various industries.

Health care: MLLMs are enhancing diagnostic processes and patient care by integrating and analyzing diverse data forms. For instance, they can analyze visual data from medical imaging (like X-rays or MRIs) alongside textual clinical notes and time-series data such as blood pressure or heart rate trends. These capabilities assist physicians in diagnosing conditions more accurately and swiftly. By identifying patterns that may not be immediately obvious to human observers, MLLMs contribute to more precise and timely diagnoses, improving overall patient outcomes.
Enterprise productivity: Within enterprises, MLLMs are being deployed to improve productivity through advanced support systems. Employees interact with AI-driven assistants that understand queries in both text and other data forms, such as charts or graphs. These systems help to generate reports, summarize meetings, and even provide real-time assistance during presentations by interpreting and responding to multimodal data.
Customer service: MLLMs are revolutionizing customer service by enabling more sophisticated user interactions. For example, customer service AIs can analyze videos or images sent by customers to quickly diagnose product issues. Additionally, they can understand and respond to vocal tone and emotional cues in customer voice communications, providing responses that are not only contextually appropriate but also empathetically aligned with the customer’s emotional state—leading to improved customer satisfaction and more efficient issue resolution.

How to Get Started With MLLMs

Simulation-Based

Selecting an appropriate development framework is essential for working with MLLMs. It’s important to choose a framework that not only supports the specific modalities relevant to respective projects but also fits well with the existing technology stack and development practices.

NVIDIA NeMo™ is an end-to-end platform for developing custom generative AI. The NeMo framework provides a comprehensive library designed to facilitate the creation and fine-tuning of MLLMs across various data modalities.

NeVA (LLaVA): Provides training, fine-tuning, and inference capabilities.
VideoNeVA (LLaVA): Provides training and inference capabilities for video modality.

The effectiveness of an MLLM heavily depends on the quality and alignment of the multimodal data it’s trained on. This involves collecting datasets that include aligned pairs or groups of different modalities, such as text-image pairs or video with captions. Proper preprocessing and normalization of this data is crucial to ensure that it can be effectively used to train the model.

Leveraging a pretrained model can significantly reduce the need for extensive computational resources and provide a shortcut to achieving effective results. Fine-tuning this model on a specific dataset allows it to adapt to the particular characteristics and requirements of an application, enhancing its performance and relevance.

Once the model is set up, it’s important to test it extensively with real-world data and scenarios. This testing phase is critical to understanding how well the model performs and identifying any areas where it may need further refinement. Continuous iteration based on performance feedback is key to developing a robust MLLM that reliably meets your objectives.

Deploying an MLLM involves integrating it into a suitable operational environment where it can receive inputs and generate outputs as required. Post-deployment, it’s important to monitor the model’s performance continuously and adjust its configuration as needed to maintain its effectiveness and efficiency.

Next Steps

Explore Our LLM Solutions

Find out how NVIDIA is helping to democratize large language models for enterprises through our LLM solutions.

Learn More

Watch LLM Videos and Tutorials on Demand

This playlist of free large language model videos includes everything from tutorials and explainers to case studies and step-by-step guides.

Learn More

Deepen Your Technical Knowledge of LLMs

Learn more about developing large language models on the NVIDIA Technical Blog.

Learn More