We have heard about Face Recognition, Object Detection, Emotion Identified, etc. However, we do not know how people do it or to be more precise how computer is able to accomplish such tasks. We are amazed and want to build similar systems. In this blog, the concept of Convolutional Neural Networks (CNN) as well as its application in Computer Vision (CV) will be discussed. We’ll start by introducing the problems in CV – image classification, localization, etc. – and how images are stored in a computer. Thereafter, we’ll describe the architecture of a CNN in detail, and see why the convolutional layer is very well-suited for CV applications.
I – Computer Vision
There are many problems in CV that are needed to take into consideration. For example, we want to know whether there is a dog or a cat or any kind of animal in the picture. Additionally, we want to know the location of the objects in the picture, or maybe we even want the computer to automatically describe, give caption to the picture.
Image Classification is the task of taking an input image and outputting a class (a cat, a dog, a horse, etc.) or a probability of classes that best describes the image. As a human, we have started to learn classifying different objects since we were born. We can easily distinguish between the background (the environment) and the object effortlessly. The skills of recognizing and identifying patterns, generalizing from prior knowledge, and adapting to different type of image are ones that we do not share with our computers. Another example of this type of problem is recognizing hand-written digits or alphabets. In this tasks, the machine should be able to tell which number is written by human and perform related actions – process, print, say it out, etc. This has many applications in real-life like processing the address written on the postcard, detect the numbers on car plates or make the doctor prescription more read-able.
Unlike humans, computers cannot easily identify and generalize patterns. There are many challenges for our machines to deal with this type of problem: a single instance of object can be viewed from different angles (Viewpoint variation), they have different sizes (Scale variation), they can be deformed in extreme way (Deformation), only a small part of the object is visible (Occlusion), the affect of light in the picture (Illumination conditions), the object can blend into the environment (Background Clutter), and maybe there are many variations of the same object (Intra-class variation). Therefore, a good image classifier should be able to solve all the mentioned problems.
Later on, after successfully classifying the image, we want to localize our object. The task is not only to produce a class label as in image recognition but also a bounding box that tells where the object is in the picture. Upgrading the problems, we may want our machine to detect multiple objects in the image. Therefore, there will not only one but many bounding boxes with different label in the image. Finally, we want the computer to give a description if what it see in the image e.g. a man is riding a horse next to a dog.
Nevertheless, computers do not comprehend images the same way as we do. We see colors, shapes, saturation, etc. On the other hand, computers only see numbers.
Images are stored in computer in form of array of numbers, where each number is a representative for a pixel in the image. The image could be presented in grayscale or in RGB. For RGB images, there will be 3 channels, and thus result in 3 two-dimensional array of pixels (for grayscale image there will only one). Nevertheless, in both situation, we can observe that to be able to fulfill all the above challenges seems quite challenging.
However, recent advances in Deep Learning made these tasks possible. One of the most popular variance of Neural Networks to deal with images, videos is Convolutional Neural Networks.
II – Convolutional Neural Networks
In the previous blog, we have had a glimpse about different types of Neural Networks. Convolutional Neural Networks are really similar to the ordinary Neural Networks. However, it was explicitly assumed that input is image. What is so special about Convolutional Neural Networks that make it possible to deal with problems with images?
In a normal Neural Network, we can clearly observe that for each neuron, it is fully-connected to the neurons in the previous and next layer. This means that every single neuron in the network will has different weight when connect to others. Convolutional Neural Networks, on the other hand, don’t have this feature of “fully-connected”. Instead, groups of neurons will have the same weight when connect to the next layer. We can imagine this behavior as there is a small piece of paper (filter) applied on top of a bigger one (input).
The idea behind this is that, unlike traditional inputs, which have different meanings like input size, number of rooms, floors of a house and doesn’t really show strong relation between each other, images have connections inside it. For instance, a picture of a face – the eyes are groups of pixels, or maybe to detect background – the sky is a huge area in the picture. Therefore, the pixels are related and thus should be treated in the same manners. Thus, this will help to recognize patterns that ordinary neural networks cannot.
Another big different is that, CNN takes input as a 3D volume of neurons rather than single ones. With this architecture, it is more flexible when dealing with images.
Now, we will explore a little bit deeper into the CNN. Let start with the general architecture. A normal architecture will often be constructed as follow – An input layer followed by one or several Convolutional Layer(s), activated then followed by a Max/Average Pooling Layer, the combination of these two is repeated several times. Later, the consequences of them will be flatten into a column vector and is fully-connected to another ordinary layer (like what we have known from previous blog) or we can call it Fully-Connected Layer. Finally, there will be an output layer, the number of neurons is equal to the number of classification we want to make e.g. 10 neurons for 10 digits.
In normal Neural Network, each column of neuron is called layer. However, in CNN, several new terms are introduced. First, let’s start with Convolutional Layer.
As mentioned above, instead of having different weights for each neuron, in CNN, group of neurons will be applied the same weight. This process is done through a special type of layer – Convolutional Layer (Conv-Layer). A Conv-Layer is represented in form of matrix and has the same dimension as the input layer i.e. if the input is a 2D then Conv layer will also be 2D. But its size is smaller (it is often 3×3, 5×5, or 7×7).
The purpose of this layer is to figure out features contained in the image, for instances, the vertical/horizontal edges, gradients, etc. In order to have multiple features examined, there will be multiple different filters. Together, they will form the output of neurons that are connected to local regions in the input. In the other words, the output after this layer are the features extracted from the input of regions in the images. To get the output, we will perform dot-product between the Conv-Layer and the input layer.
The Conv-Layer will move step by step from left to right, top to bottom on the input. At each step, it will move by number of strides that was specified.
Zero-padding is also used. It will pad the input volume with zeros around the border. The nice feature of zero-padding is that it allows us to control the size of the output. If the output size is the same as the input, we call this padding SAME. If they are not the same, we call the padding VALID (when padding is VALID, it often means that no padding was used). Then, the size of the output layer will be calculated according to this formula:
where N is the size of the input layer, F is the size of the Conv-Layer, P is the number of zero-padding used, and S is the number of strides.
For example, we specified:
- Size of the input is 28×28; (NxN)
- Size of the Conv-Layer is 3×3; (FxF)
- Stride equal to 1; (S)
- Zero-padding equal to 1; (P)
The output size will be (SAME Padding)
After we perform the convolution, we will activate the output (element-wise) according to activation function. ReLU is widely used at this step. This step can be combined with the Conv-Layer and form a single step Convolution + ReLU.
It is common to periodically insert a Pooling Layer in between successive Conv-Layers in CNN architecture. Its function is to reduce the size of the representation (the matrix) to reduce the amount of parameters (each neuron can be considered a parameter) and computation in the network. It will select the most important feature in the regions (Max-pooling) or average them out (Average-pooling). The most common form is Max-pooling with filters size of 2×2 and stride of 2. This will reduce the size of input by half.
The Conv-Layer and Pooling Layer will repeated several times to extract features deeply in the images. Then, it will be flatten into a column vector, where it will fully-connected to another layer.
Neurons in Fully-Connected Layer (FC-Layer) have full connections to all activations in the previous layer, as seen in regular Neural Network. There will be few FC-Layers at the end of the Network. This is to help the performance of our classifier, which is a Softmax Layer (Softmax is an activation function that is often used in multi-class classification).
We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies element-wise non-linearity.
These layers are often stacked together follow the pattern:
INPUT → [[CONV+RELU] * N → POOL] * M → [FC + RELU] * K → FC (OUTPUT)
where * indicates repetition, POOL can be Max or Average-pooling, N is between 0 and 3, M is greater or equal to 0, K is between 0 and 2. For example, here are some common CNN architectures that we may see:
- INPUT → FC – implements a linear classifier. Here M = N = K = 0.
- INPUT → CONV → RELU → FC
- INPUT → [CONV → RELU → POOL] * 2 → FC + RELU → FC. Here we see a Pooling Layer follows after every Conv-Layer.
- INPUT → [CONV → RELU → CONV → RELU → POOL] * 3 → [FC + RELU] * 2 → FC. Here we see a Pooling Layer after two Conv-Layers. This is a good idea for larger and deeper networks, because stacked Conv-Layers help to develop more complex features of input volume
The idea of a larger and deeper network is that in the early layer we will try to extract small features related to the picture. Later, when going deeper, these features will gradually form bigger ones and have more meanings.
Case Studies: AlexNet, GoogLeNet, VGGNet, ResNet
There are many other architectures for CNNs. Indeed, the best way to learn about this type of neural network is to apply the one that is best suited for our application, modify it a bit and train it. This is what we call transfer learning. We will discuss more about it later.
That was a brief introduction to Computer Vision and Convolutional Neural Network. We will dig deeper into how we can build a Face-Recognition and Object Detection system using state-of-the-art CNN architecture in the next blog. Stay tuned for more!