Computer Vision

Computer vision defines a process wherein computers and complex algorithms are used to understand digital images and videos.

What is Computer Vision?

As a sub-group of AI and deep learning, computer vision trains convolutional neural networks (CNNs)  to develop human-like vision capabilities for applications. Computer vision has a primary goal of first understanding the content of videos and still images, then formulating useful information from them to solve an ever-widening array of problems. Computer vision can include specific training of CNNs for segmentation, classification, and detection using images and videos for data.

 Input image, convolutional layers, fully connected layer, and output class.

Why Computer Vision?

Computer vision has infinite applications, including sports, automotive, agriculture, retail, banking, construction, insurance, and beyond. AI-driven machines of all types are becoming powered with eyes like ours, thanks to convolutional neural networks (CNNs)—the image crunchers now used by machines to identify objects. CNNs are today’s eyes of autonomous vehicles, oil exploration, and fusion energy research. They can also help spot diseases faster in medical imaging and save lives.

Convolutional neural networks can perform segmentation, classification, and detection for a myriad of applications:

  • Segmentation: Image segmentation is about classifying pixels to belong to a certain category, such as a car, road, or pedestrian. It’s widely used in self-driving vehicle applications, including the NVIDIA DRIVE software stack, to show roads, cars, and people.  Think of it as a visualization technique that makes what computers do easier to understand for humans.
  • Classification: Image classification is used to determine what’s in an image. Neural networks can be trained to identify dogs or cats, for example, or many other things with a high degree of precision, given sufficient data.
  • Detection: Image detection allows computers to localize where objects exist. It puts rectangular bounding boxes—like in the lower half of the image below—that fully contain the object. A detector might be trained to see where cars or people are within an image, for instance, as in the numbered boxes below.

Segmentation, Classification and Detection




Good at delineating objects

Is it a cat or a dog?

Where does it exist in space?

Used in self-driving vehicles

Classifies with precision

Recognizes things for safety

Computer vision systems are far better than humans at classifying images and videos into finely discrete categories and classes, like minute changes over time in  medical computerized axial tomography or CAT scans. In this sense, computer vision automates tasks that humans could potentially do, but with far greater accuracy and speed.

With the wide range of current and potential applications, it isn’t surprising that growth projections for computer vision technologies and solutions are prodigious. One market research survey maintains this market will grow a stunning 47% annually through 2023, when it will reach $25 billion globally. In all of computer science, computer vision stands among the hottest and most active areas of research and development. 

Much of these applications of AI are made possible by decades of advances in deep neural networks and strides in high-performance computing from GPUs to process massive amounts of data.

How Does Computer Vision Work?

Computer vision analyzes images, and then creates numerical representations of what it ‘sees’ using a convolutional neural network (CNN). A CNN is a class of artificial neural network that uses convolutional layers to filter inputs for useful information. The convolution operation involves combining input data (feature map) with a convolution kernel (filter) to form a transformed feature map. The filters in the convolutional layers (conv layers) are modified based on learned parameters to extract the most useful information for a specific task. Convolutional networks adjust automatically to find the best feature based on the task. The CNN would filter information about the shape of an object when confronted with a general object recognition task but would extract the color of the bird when faced with a bird recognition task. This is based on the CNN’s understanding that different classes of objects have different shapes, but that different types of birds are more likely to differ in color than in shape.


Industry Use Cases for Computer Vision

Use cases of computer vision include image recognition, image classification, video labeling, robots, virtual assistants, and self-driving cars. Prominent use cases for computer vision include:

  • Medicine. Medical image processing involves the speedy extraction of vital image data to help properly diagnose a patient, including rapid detection of tumors and hardening of the arteries. While computer vision cannot by itself be trusted to provide diagnoses, it is an invaluable part of modern medical diagnostic techniques, minimally reinforcing what physicians think and, increasingly, providing information physicians otherwise would not have seen.
  • Military. Advanced computer vision systems can direct missiles and other armaments to specific targeted areas rather than to specific targets. Once the missile closes on the area, the final destination is determined by local images acquired and instantly analyzed by the computer vision system.
  • Autonomous vehicles. Another very active area of computer vision research, autonomous vehicles can be taken over entirely by computer vision solutions or their operations can be significantly enhanced. Common applications already at work include early warning systems in cars.
  • Industrial uses. Manufacturing abounds with current and potential uses of computer vision solutions in support of manufacturing processes. Current uses included quality control wherein computer vision systems inspect parts and finished products for defects. In agriculture, computer vision systems use optical sorting to remove unwanted materials from food products.

Data scientists and computer vision
Python is the most popular programming language for machine learning (ML), and most data scientists are familiar with its ease of use and its large store of libraries—most of them free and open-source. Data scientists use Python in ML systems for data mining and data analysis, as Python provides support for a broad range of ML models and algorithms. Given the relationship between ML and computer vision, data scientists can leverage the expanding universe of computer vision applications to businesses of all types to extract vital information from stores of images and videos and augment data-driven decision-making.

Accelerating Convolutional Neural Networks using GPUs

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

 The difference between a CPU and GPU.

Because neural nets are created from large numbers of identical neurons, they’re highly parallel by nature.  This parallelism maps naturally to GPUs, which provide a data-parallel arithmetic architecture and a significant computation speed-up over CPU-only training. This type of architecture carries out a similar set of calculations on an array of image data. The single-instruction, multiple-data (SIMD) capability of the GPU makes it suitable for running computer vision tasks, which often involve similar calculations operating on an entire image. Specifically, NVIDIA GPUs significantly accelerate computer vision operations, freeing up CPUs for other jobs.  Furthermore, multiple GPUs can be used on the same machine, creating an architecture capable of running multiple computer vision algorithms in parallel.

NVIDIA GPU-Accelerated Deep Learning Frameworks

GPU-accelerated deep learning frameworks provide  interfaces to commonly used programming languages such as Python. They also provide flexibility to easily create and explore custom CNNs and DNNs,  while delivering the high speed needed for both experiments and industrial deployment. The NVIDIA Deep Learning SDK accelerates widely-used deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano, and Torch, as well as many other machine learning applications. The deep learning frameworks run faster on GPUs and scale across multiple GPUs within a single node. To use the frameworks with GPUs for convolutional neural network training and inference processes, NVIDIA provides cuDNN and TensorRT™ respectively. cuDNN and TensorRT provide highly tuned implementations for standard routines such as convolution, pooling, normalization, and activation layers.

Click here for a step-by-step NVCaffe  installation and usage guide. A fast C++/CUDA implementation of convolutional neural networks can be found here.

To develop and deploy a vision model in no-time, NVIDIA offers the DeepStream SDK, for vision AI developers. It also includes Transfer Learning Toolkit (TLT) to create accurate and efficient AI models for computer vision domain. 

Training versus Inference.

NVIDIA GPU-Accelerated, End-to-End Data Science

The NVIDIA RAPIDS suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs.

End-to-end data science and analytics pipelines entirely on GPUs

Next Steps

For a more technical deep dive on CNNs, check out our developers site.

To learn more:

For more technical information read: