GPU-Accelerated Apache Spark

For Data Analytics, Machine Learning, and Deep Learning Pipelines

GPU-accelerate your Apache Spark 3 data science pipelines—without code changes—and speed up data processing and model training while substantially lowering infrastructure costs.

 

Why Apache Spark?

Key Benefits of Spark on NVIDIA GPUs

Faster Execution Time

Faster Execution Time

Accelerate the performance of data preparation tasks to quickly move to the next stage of the pipeline. This allows models to be trained faster, while freeing up data scientists and engineers to focus on the most critical activities.

Streamline Analytics to AI

Streamline Analytics to AI

Spark 3 orchestrates end-to-end pipelines—from data ingest, to model training, to visualization.The same GPU-accelerated infrastructure can be used for both Spark and ML/DL (deep learning) frameworks, eliminating the need for separate clusters and giving the entire pipeline access to GPU acceleration.

Reduced Infrastructure Costs

Reduced Infrastructure Costs

Do more with less: Spark on NVIDIA® GPUs completes jobs faster with less hardware when compared to CPUs, saving organizations time as well as on-premises capital costs or operational costs in the cloud.

Spark 3 Innovations

Given the “embarrassingly parallel” nature of many data processing tasks, it’s only natural that the architecture of a GPU should be leveraged for Spark data processing queries, similar to how a GPU accelerates DL workloads in AI. GPU acceleration is transparent to the developer and requires no code changes in order to obtain these benefits. Three key advancements in Spark 3 have contributed to delivering transparent GPU acceleration:

New RAPIDS Accelerator for Spark 3

NVIDIA CUDA® is a revolutionary parallel computing architecture that supports accelerating computational operations on the NVIDIA GPU architecture. RAPIDS, incubated at NVIDIA, is a suite of open-source libraries layered on top of CUDA that enables GPU-acceleration of data science pipelines.

NVIDIA has created a RAPIDS Accelerator for Spark 3 that intercepts and accelerates ETL pipelines by dramatically improving the performance of Spark SQL and DataFrame operations.

Modifications to Spark Components

Spark 3 provides columnar processing support in the Catalyst query optimizer which is what the RAPIDS Accelerator plugs into to accelerate SQL and DataFrame operators. When the query plan is executed, those operators can then be run on GPUs within the Spark cluster.

NVIDIA has also created a new Spark shuffle implementation that optimizes the data transfer between Spark processes. This shuffle implementation is built upon GPU-accelerated communication libraries, including UCX, RDMA, and NCCL.

GPU-Aware Scheduling in Spark

Spark 3 recognizes GPUs as a first-class resource along with CPU and system memory. This allows Spark 3 to place GPU-accelerated workloads directly onto servers containing the necessary GPU resources as they are needed to accelerate and complete a job.

NVIDIA engineers have contributed to this major Spark enhancement, enabling the launch of Spark applications on GPU resources in Spark standalone, YARN, and Kubernetes clusters.

Accelerated Analytics and AI on Spark

Spark 3 marks a key milestone for analytics and AI, as ETL operations are now accelerated while ML and DL applications leverage the same GPU infrastructure. The complete stack for this accelerated data science pipeline is shown below:

Accelerated Analytics and AI on Spark

GET STARTED WITH GPU-ACCELERATED SPARK

Download the RAPIDS Accelerator for Spark 3 to GPU-accelerate your Apache Spark data science pipelines. Customers can also contact the Nvidia Spark team in GitHub here.

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases… we are currently implementing this integration, and already seeing over 10x speed improvements at half the cost for our data engineering and data science workflows.

– Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases… we are currently implementing this integration, and already seeing over 10x speed improvements at half the cost for our data engineering and data science workflows.

- Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases… we are currently implementing this integration, and already seeing over 10x speed improvements at half the cost for our data engineering and data science workflows.

- Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

Download Our Free eBook

Are you looking to unlock the value of big data with the power of AI? Download our new eBook, “Accelerating Apache Spark 3.x – Leveraging NVIDIA GPUs to Power the Next Era of Analytics and AI" to learn more about the next evolution in Apache Spark.