GPU-Accelerated Apache Spark

For Data Analytics, Machine Learning, and Deep Learning Pipelines

GPU-accelerate your Apache Spark 3.0 data science pipelines—without code changes—and speed up data processing and model training while substantially lowering infrastructure costs.

Why Apache Spark?

Key Benefits of Spark on NVIDIA GPUs

Faster Execution Time

Faster Execution Time

Accelerate the performance of data preparation tasks to quickly move to the next stage of the pipeline. This allows models to be trained faster, while freeing up data scientists and engineers to focus on the most critical activities.

Streamline Analytics to AI

Streamline Analytics to AI

Spark 3.0 orchestrates end-to-end pipelines—from data ingest, to model training, to visualization.The same GPU-accelerated infrastructure can be used for both Spark and ML/DL (deep learning) frameworks, eliminating the need for separate clusters and giving the entire pipeline access to GPU acceleration.

Reduced Infrastructure Costs

Reduced Infrastructure Costs

Do more with less: Spark on NVIDIA® GPUs completes jobs faster with less hardware when compared to CPUs, saving organizations time as well as on-premises capital costs or operational costs in the cloud.

Spark 3.0 Innovations

Given the “embarrassingly parallel” nature of many data processing tasks, it’s only natural that the architecture of a GPU should be leveraged for Spark data processing queries, similar to how a GPU accelerates DL workloads in AI. GPU acceleration is transparent to the developer and requires no code changes in order to obtain these benefits. Three key advancements in Spark 3.0 have contributed to delivering transparent GPU acceleration:

New RAPIDS Accelerator for Spark 3.0

NVIDIA CUDA® is a revolutionary parallel computing architecture that supports accelerating computational operations on the NVIDIA GPU architecture. RAPIDS, incubated at NVIDIA, is a suite of open-source libraries layered on top of CUDA that enables GPU-acceleration of data science pipelines.

NVIDIA has created a RAPIDS Accelerator for Spark 3.0 that intercepts and accelerates ETL pipelines by dramatically improving the performance of Spark SQL and DataFrame operations.

Modifications to Spark Components

Spark 3.0 provides columnar processing support in the Catalyst query optimizer which is what the RAPIDS Accelerator plugs into to accelerate SQL and DataFrame operators. When the query plan is executed, those operators can then be run on GPUs within the Spark cluster.

NVIDIA has also created a new Spark shuffle implementation that optimizes the data transfer between Spark processes. This shuffle implementation is built upon GPU-accelerated communication libraries, including UCX, RDMA, and NCCL.

GPU-Aware Scheduling in Spark

Spark 3.0 recognizes GPUs as a first-class resource along with CPU and system memory. This allows Spark 3.0 to place GPU-accelerated workloads directly onto servers containing the necessary GPU resources as they are needed to accelerate and complete a job.

NVIDIA engineers have contributed to this major Spark enhancement, enabling the launch of Spark applications on GPU resources in Spark standalone, YARN, and Kubernetes clusters.

Accelerated Analytics and AI on Spark

Spark 3.0 marks a key milestone for analytics and AI, as ETL operations are now accelerated while ML and DL applications leverage the same GPU infrastructure. The complete stack for this accelerated data science pipeline is shown below:

Accelerated Analytics and AI on Spark

Getting Started with GPU-Accelerated Spark

If you’re interested in early access to the RAPIDS Accelerator for the preview release of Apache Spark 3.0, please contact the Spark team at NVIDIA.

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

cisco

Cisco has thousands of customers with big data deployments for their data lake who are constantly looking to accelerate their workloads. Apache Spark 3.0 brings newer capabilities to access NVIDIA GPUs natively, thereby defining the next generation of data lakes accelerating AI/ML, ETL, and other workloads. Cisco is working closely with NVIDIA to bring this next phase of data lake innovation to our customers.

- Siva Sivakumar, Senior Director Data Center Solutions, Cisco

Download Our Free eBook

Are you looking to unlock the value of big data with the power of AI? Download our new eBook, “Accelerating Apache Spark 3.x – Leveraging NVIDIA GPUs to Power the Next Era of Analytics and AI" to learn more about the next evolution in Apache Spark.