VisionISP: an Image Processing Pipeline for Computer Vision Applications


When we think of cameras, we usually think
of them being used by humans to take pictures. Although humans are generally considered to
be the primary consumers of images captured by cameras, machines have also started consuming
images on a very large scale. Today, virtually all computer vision applications
use cameras optimized for human viewers. The image signal processors in those cameras
are usually tuned based on photography-driven image quality characteristics that are important
to the human visual system. However, tuning an imaging pipeline for the
optimal image quality does not guarantee optimal results for computer vision tasks. For instance, heavier noise reduction might
result in better perceptual image quality, whereas it might be more beneficial for a
machine vision system to tolerate a higher level of noise in exchange for more information. An image processed for optimal computer vision
performance can look vastly different from an image processed for optimal perceptual
image quality. We propose a set of processing blocks, which
we collectively call the VisionISP, to build an image processing pipeline optimized for
machine consumption. The blocks in VisionISP are simple, content-aware,
and trainable. The first block is a computer-vision-driven
denoiser that tunes an existing ISP without modifying the underlying algorithms and hardware
design. Our ISP tuning algorithm tunes the denoising
parameters to minimize a high-level content loss in the denoised images. We calculate this loss on feature maps extracted
from a target neural network that is pre-trained to perform a particular computer vision task. In this way, the denoiser learns to preserve
what’s important in the image for the target machine vision task. The second block in our pipeline is a trainable
local tone mapping operator which reduces the bit-depth of its input. Reducing the bit depth translates into simpler
hardware, reduced bandwidth, and significant savings in power. Unlike uniform bit-depth reduction, our method
makes sure that the features that are essential for computer vision applications are preserved
after bit-depth reduction. We do this by using a global non-linear transformation
followed by a local detail boosting operator before bit-depth reduction. The non-linear transform acts as a trainable
global tone mapping operator while the detail boosting operator acts locally to preserve
the details in the low-bit-per-pixel output. The last block in our pipeline is a very simple
convolutional neural network that acts as a preprocessor for a subsequent computer vision
task. The first layer in this block is a 1×1 convolution
layer that can be thought of as a trainable color space converter that finds an optimal
color space automatically. The following layer is a 7×7 convolution layer
that extracts low-level features, such as edges and textures while reducing the input
resolution. This layer has a flexible stride parameter
that allows for adjusting the downscaling rate, without retraining. Many computer vision systems use downscaled
images to be able to run in real-time. However, conventional downscaling methods,
such as bilinear interpolation, are content-agnostic. Therefore, small details in a scene, such
as pedestrians, can easily be discarded during such downscaling. This feature extraction layer processes full
resolution data and helps downscale images without dropping the features that would be
needed in the next stages. The final layer in the block projects the
output feature maps into three channels since computer vision systems typically expect 3-channel
images as inputs. Although those 3-channel inputs do not look
natural to human viewers when visualized as pseudo-color images, they `look good’ to machines
in the sense that they provide a very efficient representation of what a camera should feed
into a computer vision system to perform well. For example, this demo video shows how object
detection results look on frames processed by the VisionISP as compared to a basic image
signal processor that is not optimized for computer vision. The VisionISP significantly reduces data transmission
needs between an image signal processor and a computer vision engine while preserving
the information relevant for a computer vision system. Take a look at our paper to see our experimental
results and learn more about how VisionISP works.

One comment

  • It look like interesting. I have question. Could you explain the Trainable Vision Scaler. What is the ground truth for training. Because in the paper. Your TVS is better than bilinear scaling. However TVS is sounds like basically cnn. Which means there is ground truth. and How can we make those ground truth and is it work well in any other computer vision task?

Leave a Reply

Your email address will not be published. Required fields are marked *