R-CNN algorithm was put forward in 20 14, which basically laid the foundation for the application of two-stage method in the field of target detection. Its algorithm structure is as follows.
The algorithm steps are as follows:
Compared with the traditional target detection algorithm, R-CNN achieves 50% performance improvement. When VGG- 16 model is used as the object recognition model, the accuracy can reach 66% on voc2007 data set, which is not bad. Its biggest problem is that it is very slow and takes up a lot of memory. There are two main reasons.
Aiming at some problems of R-CNN, in 20 15, Microsoft proposed a fast R-CNN algorithm, which mainly optimized two problems.
Both R-CNN and fast R-CNN have a problem, that is, the candidate box is generated by selective search, which is very slow. Moreover, 2000 candidate frames generated by R-CNN need to pass through the convolutional neural network once, which means it needs to pass through the CNN network about 2000 times, which is very time-consuming (fast R-CNN has been improved, and the whole picture only needs to pass through the CNN network once). This is also the main reason for the slow detection speed of these two algorithms.
In order to solve this problem, fast R-CNN proposed RPN network to obtain candidate frames, which got rid of the selective search algorithm and only needed one convolution layer operation, which greatly improved the recognition speed. This algorithm is very complicated, and we will analyze it in detail. Its basic structure is as follows.
Mainly divided into four steps:
The network structure adopts VGG- 16 convolution model;
The convolution layer adopts VGG- 16 model. First, the original image of PxQ is scaled and cut into an image of MxN, and then it passes through 13 conv- Leilu layers, in which four max-pooling layers are interspersed. All convolution kernels are 3×3, the padding is 1, and the step is 1. The core of the pool layer is 2x2, the filling is 0, and the step is 2.
The image of MxN, after convolution layer, becomes a feature map of (M/ 16) x (N/ 16).
The faster R-CNN abandons the selective search method in R-CNN and uses RPN layer to generate candidate frames, which can greatly improve the speed of generating candidate frames. The RPN layer is first convolved by 3×3, and then divided into two paths. One way is used to judge whether the candidate box is foreground or background. It first reshapes it into a one-dimensional vector, then softmax judges whether it is foreground or background, and then reshapes it into a two-dimensional feature map. The other path is used to determine the position of the candidate box, which is realized by bounding box regression, which will be discussed in detail later. After two-way calculation, the foreground candidate frame is selected (because the object is in the foreground), and the feature subgraph proposal that we are interested in is obtained by using the calculated candidate frame position.
The convolution layer extracts the original image information and gets 256 feature maps. After 3×3 convolution of RPN layer, there are still 256 feature maps. But each point fuses the spatial information of the surrounding 3x3. For a point on each feature graph, k anchors are generated (k(k defaults to 9). Anchor is divided into foreground and background (whether we are an airplane or a car, we only need to distinguish between foreground and background). Anchor has four coordinate offsets [x, y, w, h], where x and y represent the coordinates of the center point and w and h represent the width and height. In this way, for each point on the feature map, k selection regions with different sizes and shapes are obtained.
For the generated anchor, we must first judge whether it is the foreground or the background. Since the object of interest is in the foreground, after this step, we can abandon the background anchor point. Most anchor points belong to the background, so this step can filter out many useless anchor points, thus reducing the calculation amount of the full connection layer.
256 feature maps obtained after 3×3 convolution are transformed into 18 feature maps after 1x 1 convolution. Then reshape is a one-dimensional vector, and softmax judges whether it is foreground or background. The only function of shaping here is to make data available for softmax calculation. And then outputting the foreground anchor point obtained by identification.
The other path is used to determine the position of the candidate frame, that is, the [x, y, w, h] coordinate values of the anchor point. As shown below, red represents our current constituency and green represents the real constituency. Although we can choose the plane roughly now, there is still a big gap from the real position and shape of green, so we need to adjust the generated anchor points. This process is called bounding box regression.
Assuming that the coordinates of the red frame are [x, y, w, h] and the coordinates of the green frame, that is, the coordinates of the target frame are [Gx, Gy, Gw, Gh], we need to establish a transformation so that [x, y, w, h] can become [Gx, Gy, Gw, Gh]. The simplest idea is to pan so that the center point is close, and then zoom so that W and H are close. As follows:
What we are going to learn is the four transformations of dx dy dw dh. Because it is a linear transformation, it can be modeled by linear regression. After setting the loss and optimization method, you can use deep learning to train and get the model. For spatial position loss, we generally use mean square error algorithm instead of cross entropy (cross entropy is used for classification prediction). The optimization method can adopt the adaptive gradient descent algorithm Adam.
After getting the foreground anchor points and determining their positions and shapes, we can output the feature subgraph suggestions of the foreground. The steps are as follows:
1, and get the foreground anchor point and its [x y w h] coordinates.
2. According to the different probabilities that the anchor points are foreground, select the first pre_nms_topN anchor points in descending order, such as the first 6000 anchor points.
3. Exclude very small anchors.
4. Find the anchor with high credibility through NMS non-maximum suppression. This is mainly to solve the problem of overlapping choices. Firstly, the area of each selection is calculated, and then sorted according to their scores (that is, the probability of foreground or not) in softmax, and the selection with the largest score is put in the queue. Next, calculate the IOU of the remaining selection and the current maximum score selection (IOU is the intersection area of two boxes divided by the union area of two boxes to measure the overlapping degree of two boxes). Remove the selection with IOU greater than the set threshold. This solves the problem of overlapping constituencies.
5. Select the first post_nms_topN result as the final selection suggestion for output, such as 300.
After this step, object positioning should be basically over, and the rest is object recognition.
Similar to fast R-CNN, this layer mainly solves the problem that the previously obtained proposals have different sizes and shapes and cannot be fully connected. Full connection calculation can only operate on certain shapes, so it is necessary to make the proposal the same size and shape. This problem can be solved by cropping and scaling, but it will bring information loss and image distortion. We can solve this problem effectively by using ROI pooling.
In ROI pooling, if the target output is MxN, the input suggestion is divided into MxN parts in horizontal and vertical directions, and each part takes the maximum value, thus the output characteristic diagram of MxN is obtained.
The feature map behind the ROI Pooling layer can be calculated by fully connecting the layer with softmax, such as people, dogs and airplanes, and the cls_prob probability vector can be obtained. At the same time, the proposed position is fine-tuned by bounding box regression again, and bbox_pred is obtained, which is used to return more accurate target detection frames.
This completes the whole process of faster R-CNN. The algorithm is quite complicated, and every detail needs to be understood repeatedly. The faster R-CNN uses the resNet 10 1 model as the convolution layer, and the accuracy rate can reach 83.8% on the voc20 12 data set, surpassing yolo ssd and yoloV2. Its biggest problem is its slow speed, which can only process 5 frames per second, and it can't meet the requirements of real-time.
Yolo creatively proposed a one-stage method to overcome the common shortcomings of the two-stage target detection algorithm. That is, object classification and object positioning are completed in one step. Yolo directly returns the position and category of the bounding box in the output layer, which realizes one step. In this way, yolo can achieve the operation speed of 45 frames per second, which can completely meet the real-time requirements (when it reaches 24 frames per second, the human eye thinks it is continuous). Its network structure is as follows:
It is mainly divided into three parts: convolution layer, target detection layer and NMS shielding layer.
Using Google inceptionV 1 network, corresponding to the first stage of the above picture, ***20 layers. This layer mainly carries out feature extraction, thus improving the generalization ability of the model. However, the author modified the initiation v1. Instead of using the inception module structure, he used the convolution of 1x 1 and the convolution of 3x3 for parallel replacement. (It can be considered that only one branch of the inception module is used, which should simplify the network structure. )
After four convolution layers and two fully connected layers, the output of 7x7x30 is finally generated. The purpose of going through four convolution layers first is to improve the generalization ability of the model. Yolo divides a 448x448 original picture into 7×7 grids, and each grid should predict the coordinates (x, y, w, h) of two bounding boxes, as well as the confidence of the objects contained in the boxes and the probability that the objects belong to each of 20 categories (Yolo's training data is voc20 12, which is a data set containing 20 categories). So the corresponding parameter of a grid is (4x2+2+20) = 30. The following figure
Among them, the former item indicates whether there are artificially marked objects falling into the grid, and if there are, it is 1, otherwise it is 0. The second term indicates the coincidence degree between the bounding box and the real marking box. It is equal to the intersection of the areas of two boxes, divided by the area and the set. The larger the value, the closer the box is to the real position.
Classification information: yolo's target training set is voc20 12, which is a 20-classification target detection data set. The commonly used target detection data sets are as follows:
| name | # images (trainval) | # classes | Last updated |
| - | - | - | - |
| ImageNet | 450k | 200 | 20 15 |
| Cocoa | 120K | 90 | 20 14 |
| Pascal VOC | 12k | 20 | 20 12 |
Oxford -IIIT pet hospital | 7K | 37 | 20 12 |
| KITTI Vision | 7K | 3 | |
Each grid also needs to predict the probability that it belongs to these 20 categories. The classification information is for each grid, not the bounding box. So we only need 20, not 40. The confidence is aimed at the bounding box, which only indicates whether there is an object in the box, and it is not necessary to predict which of the 20 categories the object is, so only two parameters are needed. Although classification information and confidence are probabilities, their meanings are completely different.
The purpose of the screening layer is to screen out the most suitable from multiple results (multiple bounding boxes). This method is basically the same as that in the faster R-CNN. First, the boxes with scores below the threshold are filtered out, and the remaining boxes are subjected to NMS non-maximum suppression, and the boxes with high overlap are removed (the specific NMS algorithm can be reviewed in the faster R-CNN section above). In this way, the most suitable box and its category are finally obtained.
Yolo's loss function includes three parts: position error, confidence error and classification error. The specific formula is as follows:
Mean square error algorithm is used for all errors. In fact, I think the position error should use the mean square error algorithm, and the classification error should use the cross entropy. Because there are only 4 parameters in the position of an object and 20 parameters in the category, their cumulative sum is different. It is obviously unreasonable to give the same weight. So the position error weight in yolo is 5, and the category error weight is 1. Because we don't particularly care about the bounding box containing no objects, the weight given to the confidence error of the bounding box containing no objects is 0.5, and the weight of the bounding box containing objects is 1.
The faster R-CNN has higher mAP accuracy and lower recall rate, but the speed is slower. Although Yolo is very fast, its accuracy and missed detection rate are not satisfactory. SSD combines their advantages and disadvantages. For an input image of 300x300, the voc2007 data test can reach 58 frames per second (Titan X's GPU) and 72. 1% mAP.
The network structure of SSD is as follows:
Like yolo, it is divided into three parts: convolution layer, target detection layer and NMS shielding layer.
SSD paper adopts VGG 16 basic network, which is actually the usual method of almost all target detection neural networks. Firstly, CNN network is used to extract features, and then the subsequent target location and target classification are carried out.
This layer consists of five convolution layers and an average pool layer. The last fully connected layer is removed. SSD thinks that the object in target detection is only related to the surrounding information, and its receptive field is not global, so it is unnecessary and should not be completely connected. The characteristics of solid state drives are as follows.
Each convolution layer will output characteristic maps of different receptive fields. On these feature maps with different scales, training and predicting the target position and category can achieve the purpose of multi-scale detection, and can overcome the problem that yolo has low accuracy in identifying objects with abnormal aspect ratio. In yolo, only the last convolution layer is used to train and predict the target position and category. This is a key point that SSD can improve accuracy compared with yolo.
As shown in the above figure, target detection and classification will be carried out on each convolution layer, and finally the final result will be filtered by NMS and output. Target detection on multi-scale feature map is equivalent to adding a lot of bounding boxes with aspect ratio, which can greatly improve the generalization ability.
Similar to the faster R-CNN, SSD also put forward the concept of anchor. Through the convolution output feature map, each point corresponds to the central point of an area of the original image. Taking this point as the center, six anchor points with different aspect ratios and sizes (called default boxes in SSD) are constructed. Each anchor corresponds to four position parameters (x, y, w, h) and 2 1 classification probability (voc training set is a 20-classification problem, plus whether the anchor is the background, ***2 1 classification). As shown in the figure below:
In addition, in the training stage, SSD locates the ratio of positive and negative samples at 1: 3. Given the input image of the training set and the real area (ground real frame) of each object, the nearest between the default frame and the real frame is selected as the positive sample. Then choose any of the remaining default boxes with IOU greater than 0.5, and the real box IOU is a positive sample. While others serve as negative samples. Because most boxes are negative samples, it will lead to positive and negative imbalance, so according to the probability of each box category, the positive and negative ratio will remain at 1: 3. SSD thinks this strategy improves the accuracy by 4%.
In addition, SSD uses data enhancement. Generate patches with IOU of 0. 1.30.50.70.9 between them and the real box of the target object, randomly select these patches to participate in the training and randomly flip horizontally. SSD thinks this strategy improves the accuracy by 8.8%.
It is basically the same as yolo's screening layer. Similarly, the default boxes whose category probability is lower than the threshold are filtered first, and then the boxes with higher overlap are filtered by NMS non-maximum suppression. It's just that SSD synthesizes the default box of target detection output on different feature maps.
SSD can basically meet the needs of real-time object detection on our mobile phones. TensorFlow official target detection model SSD _ mobilenet _ v1_ Android _ export. Pb is realized by SSD algorithm. Its basic convolutional network is mobileNet, which is suitable for deployment and operation on the terminal.
Aiming at the problems of low precision, easy to miss detection and poor effect on objects with abnormal aspect ratio, yolo V2 is proposed according to the characteristics of SSD. It mainly adopts the network structure of yolo, and makes some optimization and improvement on this basis, as follows.
The network adopts Darknet- 19: 19 layer, which contains a lot of 3×3 convolution. At the same time, the global average pool layer of 1x 1 convolution kernel is added for the reference of inceptionV 1. Its structure is as follows
Yolo and yoloV2 can only recognize 20 kinds of objects. In order to optimize this problem, yolo9000 was put forward, which can recognize 9000 kinds of objects. On the basis of yoloV2, the joint training of imageNet and coco was conducted. This method makes full use of the advantages that imageNet can identify 1000 objects and coco can detect the target position. When training with imageNet, only the parameters related to object classification are updated. When using coco, all parameters are updated.
YOLOv3 can be said to directly hang up all the image detection algorithms. Compared with DSSD (Deconvolution SSD) and FPN (Characteristic Pyramid Network) in the same period, the accuracy is higher or similar, and the speed is 1/3.
The changes in YOLOv3 mainly include the following points:
However, if a more accurate prediction framework is needed and COCO AP is used as the evaluation standard, YOLO3' s performance in accuracy is weaker. As shown in the figure below.
At present, the algorithms of target detection model are also emerging one after another. In the two-stage field, Facebook proposed mask R-CNN in 20 17. CMU also proposed A-Fast-RCNN algorithm, which introduced antagonistic learning into the field of target detection. Face++ also put forward Logger Vick's R-CNN, mainly discussing how to balance the accuracy and speed of R-CNN in object detection.
The field of first-class Taiwan is also full of flowers. In 20 17, Seoul National University put forward R-SSD algorithm, which mainly solved the problem of poor detection effect of small-sized objects. RON algorithm proposed by Tsinghua University combines the advantages of two-stage naming method and one-stage naming method, and pays more attention to multi-scale object location and negative space sample mining.
The deep learning algorithm in the field of target detection needs target location and object recognition, and the algorithm is relatively complex. At present, various new algorithms emerge one after another, but there is a strong continuity between models. Most model algorithms are based on the ideas of predecessors and stand on the shoulders of giants. We need to know the characteristics of the classic model, what problems are these tricks to solve and why. In this way, we can draw inferences from others, and we will never change our religion. To sum up, the main difficulties in the field of target detection are as follows:
Understand the AI algorithm of target detection: R-CNN, faster R-CNN, yolo, SSD, yoloV2.
Evolution from YOLOv 1 to v3
Ultra-detailed analysis of SSD-Tensorflow I: Loading model test pictures? /darknet/yolo/? /pjreddie/darknet
C# project reference:/alturosdestinations/alturos.yolo
Post a picture of the project practice.