Stream Processing

Stream processing is the continuous processing of new data events as they’re received.

 

What Is Stream Processing?

A stream is an unbounded sequence of events that go from producers  to consumers. A lot of data is produced as a stream of events, for example financial transactions, sensor measurements, or web server logs.

Stream of data from Producers to Consumers

Stream processing libraries, such as Streamz, help build pipelines to manage streams of continuous data, allowing applications to respond to events as they occur.

Streamz stream processing library pipeline

Stream processing pipelines often involve multiple actions such as filters, aggregations, counting, analytics, transformations, enrichment, branching, joining, flow control, feedback  into earlier stages, back pressure, and storage.

Why Stream Processing?

The continuous processing of data streams is useful in many applications such as:

  • Healthcare: continuous monitoring of instrument data
  • Smart cities: traffic patterns and congestion management
  • Manufacturing: optimization and predictive maintenance
  • Transportation: optimized routes and fuel consumption
  • Automobile: smart cars
  • Cyber security, anomaly detection: web or network log processing
  • Financial: stock market time series
  • Machine learning: real-time predictions
  • Advertising: location- or action-based advertising

The stream processing market is experiencing exponential growth with businesses relying heavily on real-time analytics, inferencing, monitoring, and more. Services built on streaming are now core components of daily business, and structured telemetry events and unstructured logs are growing at a rate of over 5X year-over-year. Big data streaming at this scale is becoming extremely complex and difficult to do efficiently in the modern business environment, where reliable, cost-effective streaming is paramount.

Accelerating Stream Processing with GPUs

NVIDIA RAPIDS cuStreamz is the first GPU-accelerated streaming data processing library, built with the goal to accelerate stream processing throughput and lower the total cost of ownership (TCO). Production cuStreamz pipelines at NVIDIA have saved hundreds of thousands of dollars a year. Written in Python, cuStreamz is built on top of RAPIDS, the GPU-accelerator for data science libraries. End-to-end GPU acceleration is quickly becoming the standard, as can be seen by Flink adding GPU support, and NVIDIA is excited to be part of this trend.

cuStreamz is built on:

  1. Streamz, an open-source Python library that helps build pipelines manage streams of continuous data;
  2. Dask, a robust and reliable scheduler to parallelize streaming workloads; and
  3. RAPIDS, a GPU-accelerated suite of libraries leveraged for streaming computations.

cuStreamz accelerates Streamz by leveraging RAPIDS cuDF under the hood to accelerate computations on streaming data iusing GPUs. cuStreamz also benefits from cuDF’s accelerated JSON, Parquet, and CSV readers and writers. The cuStreamz team has built an accelerated Kafka datasource connector to read data from Kafka really fast directly into cuDF dataframes, which delivers a considerable boost to end-to-end performance. Streaming pipelines can then be parallelized using Dask to run in a distributed mode for better performance at scale.

Dask Cluster

The cuStreamz architecture is summarized at a high level in the diagram below. cuStreamz is a bridge that connects Python streaming with GPUs and adds sophisticated and reliable streaming features like checkpointing and state management. cuStreamz provides the necessary building blocks to write streaming jobs that run reliably on GPUs with better performance at a lower cost.

cuStreamz architecture

GPU-Accelerated, End-to-End Data Science

The RAPIDS suite of open-source software libraries, built on NVIDIA® CUDA-X AI, provides the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

End-to-end data science and analytics pipelines entirely on GPUs

With the RAPIDS GPU DataFrame, data can be loaded onto GPUs using a Pandas-like interface, and then used for various connected machine learning and graph analytics algorithms without ever leaving the GPU. This level of interoperability is made possible through libraries like Apache Arrow. It allows acceleration for end-to-end pipelines—from data prep to machine learning to deep learning.

RAPIDS’s cuML machine learning algorithms and mathematical primitives follow the familiar scikit-learn-like API. Many popular algorithms like XGBoost are supported for both single-GPU and large data center deployments. For large datasets, these GPU-based implementations can complete 10-50X faster than their CPU equivalents.

RAPIDS supports device memory-sharing between many popular data science libraries. This keeps data on the GPU and avoids costly copying back and forth to host memory. 

Popular data science libraries.