GPU-Accelerated Caffe

Get started today with this GPU Ready Apps Guide.

Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. This popular computer vision framework is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors. Caffe powers academic research projects, startup prototypes, and large-scale industrial applications in vision, speech, and multimedia.

Caffe runs up to 65% faster on the latest NVIDIA Pascal GPUs and scales across multiple GPUs within a single node. Now you can train models in hours instead of days.

Installation

System Requirements

The GPU-enabled version of Caffe has the following requirements:

Download and Installation Instructions

1. Install CUDA

To use Caffe with NVIDIA GPUs, the first step is to install the CUDA Toolkit.

2. Install cuDNN

Once the CUDA Toolkit is installed, download cuDNN v5.1 Library for Linux (note that you'll need to register for the Accelerated Computing Developer Program).

Once downloaded, uncompress the files and copy them into the CUDA Toolkit directory (assumed here to be in /usr/local/cuda/):

$ sudo tar -xvf cudnn-8.0-linux-x64-v5.1.tgz -C /usr/local

3. Install Dependencies

Caffe depends on several libraries that should be available from your system's package manager.

On Ubuntu 14.04, the following commands will install the necessary libraries:

$ sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev git

$ sudo apt-get install --no-install-recommends libboost-all-dev

4. Install NCCL

NVIDIA NCCL is required to run Caffe on more than one GPU. NCCL can be installed with the following commands:

$ git clone https://github.com/NVIDIA/nccl.git

$ cd nccl

$ sudo make install -j4

NCCL libraries and headers will be installed in /usr/local/lib and /usr/local/include.

5. Install Caffe

We recommend installing the latest released version of Caffe from NVIDIA's fork, found at https://github.com/NVIDIA/caffe/releases. As of this writing, the latest released version is 0.15.9.

$ wget https://github.com/NVIDIA/caffe/archive/v0.15.9.tar.gz

$ tar -zxf v0.15.9.tar.gz

$ cd caffe-0.15.9

$ cp Makefile.config.example Makefile.config

Open the newly created Makefile.config in a text editor and make the following changes:

Uncomment the line USE_CUDNN := 1. This enables cuDNN acceleration.

Uncomment the line USE_NCCL := 1. This enables NCCL which is required to run Caffe on multiple GPUs.

Save and close the file. You're now ready to compile Caffe.

$ make all -j4

When this command completes, the Caffe binary will be available at build/tools/caffe.

PREPARE AN IMAGE DATABASE

A database of images is required as input to test the training performance of Caffe. Caffe comes with models that are set up to use images from the ILSVRC12 challenge ("ImageNet"). The original image files can be downloaded from http://image-net.org/download-images (you'll need to make an account and agree to their terms). Once you've downloaded and unpacked the original image files onto your system, continue with the steps below. It's assumed that the original images are stored on your disk like:

/path/to/imagenet/train/n01440764/n01440764_10026.JPEG

/path/to/imagenet/val/ILSVRC2012_val_00000001.JPEG

6. Download Auxiliary Data

$ ./data/ilsvrc12/get_ilsvrc_aux.sh

7. Create the Database

Open the file examples/imagenet/create_imagenet.sh in a text editor and make the following changes:

Change the variables TRAIN_DATA_ROOT and VAL_DATA_ROOT to the path where you unpacked the original images.

Set RESIZE=true so the images will be resized properly before being added to the database.

Save and close the file. You're now ready to create the image databases with the following command:

$ ./examples/imagenet/create_imagenet.sh

Then, create the required image mean file with:

$ ./examples/imagenet/make_imagenet_mean.sh

Training Models

ALEXNET (256 BATCH SIZE)

By default, the model is set up to fully train the network, which could take anywhere from several hours to days. For the purpose of benchmarking, we'll limit the number of iterations to 1000. Open the file models/bvlc_alexnet/solver.prototxt in a text editor and make the following change:

max_iter: 1000

Save and close the file. You can now train the network:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH

$ ./build/tools/caffe train –solver=models/bvlc_alexnet/solver.prototxt –gpu 0

……….
I0817 13:29:57.535207 30840 solver.cpp:242] Iteration 160 (1.57876 iter/s, 12.6682s/20 iter), loss = 6.90907
I0817 13:29:57.535292 30840 solver.cpp:261] Train net output #0: loss = 6.90907 (* 1 = 6.90907 loss)
I0817 13:29:57.535312 30840 sgd_solver.cpp:106] Iteration 160, lr = 0.01
I0817 13:30:10.195734 30840 solver.cpp:242] Iteration 180 (1.57974 iter/s, 12.6603s/20 iter), loss = 6.90196
I0817 13:30:10.195816 30840 solver.cpp:261] Train net output #0: loss = 6.90196 (* 1 = 6.90196 loss)
I0817 13:30:10.195835 30840 sgd_solver.cpp:106] Iteration 180, lr = 0.01
I0817 13:30:22.852818 30840 solver.cpp:242] Iteration 200 (1.58017 iter/s, 12.6568s/20 iter), loss = 6.92144
……….

You can train on multiple GPUs by specifying more device IDs (e.g. 0,1,2,3) or "-gpu all" to use all available GPUs in the system.

GOOGLENET (32 BATCH SIZE)

By default, the model is set up to fully train the network, which could take anywhere from several hours to days. For the purpose of benchmarking, we'll limit the number of iterations to 1000. Open the file models/bvlc_googlenet/solver.prototxt in a text editor and make the following change:

max_iter: 1000

Save and close the file. You can now train the network:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH

$ ./build/tools/caffe train –solver=models/bvlc_googlenet/solver.prototxt –gpu 0

……….
I0817 13:33:08.056823 30959 solver.cpp:242] Iteration 80 (7.96223 iter/s, 5.02372s/40 iter), loss = 11.1401
I0817 13:33:08.056893 30959 solver.cpp:261] Train net output #0: loss1/loss1 = 6.85843 (* 0.3 = 2.05753 loss)
I0817 13:33:08.056910 30959 solver.cpp:261] Train net output #1: loss2/loss1 = 7.00557 (* 0.3 = 2.10167 loss)
I0817 13:33:08.056921 30959 solver.cpp:261] Train net output #2: loss3/loss3 = 6.82249 (* 1 = 6.82249 loss)
I0817 13:33:08.056934 30959 sgd_solver.cpp:106] Iteration 80, lr = 0.01
I0817 13:33:13.074957 30959 solver.cpp:242] Iteration 120 (7.97133 iter/s, 5.01798s/40 iter), loss = 11.1306
I0817 13:33:13.075026 30959 solver.cpp:261] Train net output #0: loss1/loss1 = 6.91996 (* 0.3 = 2.07599 loss)
I0817 13:33:13.075042 30959 solver.cpp:261] Train net output #1: loss2/loss1 = 6.91151 (* 0.3 = 2.07345 loss)
I0817 13:33:13.075052 30959 solver.cpp:261] Train net output #2: loss3/loss3 = 6.95206 (* 1 = 6.95206 loss)
I0817 13:33:13.075065 30959 sgd_solver.cpp:106] Iteration 120, lr = 0.01
I0817 13:33:18.099795 30959 solver.cpp:242] Iteration 160 (7.96068 iter/s, 5.0247s/40 iter), loss = 11.1211
……….

You can train on multiple GPUs by specifying more device IDs (e.g. 0,1,2,3) or "-gpu all" to use all available GPUs in the system.

Benchmarks

IMAGE TRAINING PERFORMANCE ON GOOGLENET

GoogLeNet is a newer deep learning model that takes advantage of a deeper and wider network to provide higher accuracy of image classification.

Image Classification Training Performance
Caffe Performance on Multi-GPUs Per Node

Recommended System Configurations

Hardware Configuration

PC

Parameter
Specs

CPU Architecture

x86_64

System Memory

8-32GB

CPUs

1

GPU Model

NVIDIA® TITAN X

GPUs

1-2

Servers

Parameter
Specs

CPU Architecture

x86_64

System Memory

32 GB

CPUs/Nodes

1-2

GPU Model

Tesla® M40
Tesla® P100

GPUs/Node

1-4

Software Configuration

Software stack

Parameter
Version

OS

Ubuntu 14.04

GPU Driver

367.27 or newer

CUDA Toolkit

8.0

cuDNN Library

v5.1

Build Your Ideal GPU Solution Today.