Tesla

Subscribe
Speed Up Deep Learning Training
With GPU-Accelerated Caffe
The fastest, easiest way to get started with Caffe on GPUs
Speed Up Deep Learning Training With GPU-Accelerated Caffe - The fastest, easiest way to get started with Caffe on GPUs

Training Models

AlexNet (256 Batch Size)

By default, the model is set up to fully train the network, which could take anywhere from several hours to days. For the purpose of benchmarking, we'll limit the number of iterations to 1000. Open the file models/bvlc_alexnet/solver.prototxt in a text editor and make the following change:

max_iter: 1000

Save and close the file. You can now train the network:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH

$ ./build/tools/caffe train –solver=models/bvlc_alexnet/solver.prototxt –gpu 0

……….
I0817 13:29:57.535207 30840 solver.cpp:242] Iteration 160 (1.57876 iter/s, 12.6682s/20 iter), loss = 6.90907
I0817 13:29:57.535292 30840 solver.cpp:261] Train net output #0: loss = 6.90907 (* 1 = 6.90907 loss)
I0817 13:29:57.535312 30840 sgd_solver.cpp:106] Iteration 160, lr = 0.01
I0817 13:30:10.195734 30840 solver.cpp:242] Iteration 180 (1.57974 iter/s, 12.6603s/20 iter), loss = 6.90196
I0817 13:30:10.195816 30840 solver.cpp:261] Train net output #0: loss = 6.90196 (* 1 = 6.90196 loss)
I0817 13:30:10.195835 30840 sgd_solver.cpp:106] Iteration 180, lr = 0.01
I0817 13:30:22.852818 30840 solver.cpp:242] Iteration 200 (1.58017 iter/s, 12.6568s/20 iter), loss = 6.92144
……….

You can train on multiple GPUs by specifying more device IDs (e.g. 0,1,2,3) or "-gpu all" to use all available GPUs in the system.

GoogLeNet (32 Batch Size)

By default, the model is set up to fully train the network, which could take anywhere from several hours to days. For the purpose of benchmarking, we'll limit the number of iterations to 1000. Open the file models/bvlc_googlenet/solver.prototxt in a text editor and make the following change:

max_iter: 1000

Save and close the file. You can now train the network:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH

$ ./build/tools/caffe train –solver=models/bvlc_googlenet/solver.prototxt –gpu 0

……….
I0817 13:33:08.056823 30959 solver.cpp:242] Iteration 80 (7.96223 iter/s, 5.02372s/40 iter), loss = 11.1401
I0817 13:33:08.056893 30959 solver.cpp:261] Train net output #0: loss1/loss1 = 6.85843 (* 0.3 = 2.05753 loss)
I0817 13:33:08.056910 30959 solver.cpp:261] Train net output #1: loss2/loss1 = 7.00557 (* 0.3 = 2.10167 loss)
I0817 13:33:08.056921 30959 solver.cpp:261] Train net output #2: loss3/loss3 = 6.82249 (* 1 = 6.82249 loss)
I0817 13:33:08.056934 30959 sgd_solver.cpp:106] Iteration 80, lr = 0.01
I0817 13:33:13.074957 30959 solver.cpp:242] Iteration 120 (7.97133 iter/s, 5.01798s/40 iter), loss = 11.1306
I0817 13:33:13.075026 30959 solver.cpp:261] Train net output #0: loss1/loss1 = 6.91996 (* 0.3 = 2.07599 loss)
I0817 13:33:13.075042 30959 solver.cpp:261] Train net output #1: loss2/loss1 = 6.91151 (* 0.3 = 2.07345 loss)
I0817 13:33:13.075052 30959 solver.cpp:261] Train net output #2: loss3/loss3 = 6.95206 (* 1 = 6.95206 loss)
I0817 13:33:13.075065 30959 sgd_solver.cpp:106] Iteration 120, lr = 0.01
I0817 13:33:18.099795 30959 solver.cpp:242] Iteration 160 (7.96068 iter/s, 5.0247s/40 iter), loss = 11.1211
……….

You can train on multiple GPUs by specifying more device IDs (e.g. 0,1,2,3) or "-gpu all" to use all available GPUs in the system.

See How The Performance Stacks Up

View Benchmarks