feature Pyramid Networks for Object Detection FPN Paper interpretation notes //2022.5.9 On the afternoon of Sunday 16:18 Start read

Address of thesis

feature Pyramid Networks for Object Detection | IEEE Conference Publication | IEEE Xplore

Paper purpose

The generator uses the built-in multi-scale pyramid hierarchy of deep gyrus network, Construct the feature pyramid with fringy extra monetary value. Developed a top-down architecture with horizontal connectivity, Used to build high-level semantic feature maps on all scales. This structure is called have pyramid net (FPN), As a general-purpose centrifuge, It has been significantly improved in many applications. And the generator will FPN Applied to FasterRCNN Architecture, stay COCO The most advance detection results are realized on the data set .

Personal summary

This article is about FPN The pioneer work of characteristic pyramid network, first, the author introduces respective methods of multi-scale detection ( Feature image pyramid, Single scale prediction, Pyramid feature hierarchy ( Compare with FPN Come on, The relationship between low-level features with high-level semantic data and high-level features with low-level semantic information is not considered )), And then put forward FPN The Internet. then, The writer will FPN The network is applied to RPN, FastRCNN, FasterRCNN In the Internet, And with the current baseline( The service line ) Contrast. Besides, What the author will besides get includes FPN Network architecture and COCO Data set performance SOTA The model results are compared. In the extend experiment, The author focuses on using FPN Network improves the accuracy of case division, The details of the experiment are described in detail in the appendix .

The content of the paper is

1. Introduce

first, the writer introduces the concept of image feature pyramid : In calculator sight, Identifying objects of different scales is a basic challenge. Feature pyramid based on image pyramid ( In short, We call these characteristic persona pyramids ) Form the basis of standard solutions [ 1 ] ( chart 1(a)). These pyramids are scale invariant, That is, the scale change of the object is offset by moving its level in the pyramid. intuitively speaking, This attribute enables the exemplary to scan the exemplar at the localization and pyramid levels, Detect objects in a wide image. The author explains DPM Better results can be obtained because of the use of have persona pyramid. But then, The author explains that in the recognition task, Deep learning techniques – Convolutional neural network replaces trope feature pyramid. Volumes and networks are full-bodied to scale changes and have the ability to represent high-level semantic information, Therefore, recognition on a single feature scale can be realized. The independent advantage of characterizing each layer of the image pyramid is, It produces a multi-scale feature of speech representation, All of these layers have potent semantics, Including high resolution layers. however, There are besides many problems : Reasoning clock increases 、 Training this end-to-end net requires a lot of memory, So the trope pyramid network is only suitable for testing time. therefore, For these reasons, Fast/FasterRCNN Feature image pyramids are not used. The sport hierarchy of deep convolution network generates feature maps with different spatial resolutions, But it introduces a huge semantic break caused by unlike depths. The low-level features of high-resolution maps will damage the representation ability of object recognition. The writer then introduces the beginning try to use pyramid feature flush SSD( Single lens detector ), And point out that SSD Give up reusing already calculated layers, But start from the top of the network to build a pyramid ( for exemplar VGG The Internet [ 34 ] ), then add a few new layers. therefore, It misses the opportunity to reuse higher resolving power maps of feature levels. We have shown that these are crucial for detecting small objects. In late studies, exchangeable architectures with top-down and skip connections are popular [ 27,16,8,25 ]. Their goal is to generate a single high-level feature map with high resolution, And make a prediction on this basis ( Upper digit 2). contrary, Our overture takes advantage of this architecture as a feature pyramid, It predicts ( for exercise, Object signal detection ) Independently at each level ( chart 2 Bottom ). Our model echoes a characteristic image pyramid, This has not been explored in these works. chart 2 The follow is an introduction to FPN Network structure .FPN Compared with the original top-down and skim connection computer architecture :FPN Structure is to predict the characteristics of each level. The results of the experiments are on the data sets : In the ablation experiment, We found that, For the jump box schema, FPN Significantly improved the average recall pace (AR)8.0 ramify ; For target detection, It will COCO Average accuracy of stylus (AP) Improved 2.3 A little bit, PASCAL Average accuracy of style (AP) Improved 3.8 A short bite, More than the Resnet Faster on the R-CNN Strong single scale baseline [ 15 ]. Our method can besides be easily extended to mask schemes, And compared with the latest methods that rely heavily on double pyramids, It improves the efficiency of exemplify division AR And speed. Besides, This improvement is achieved without increasing the single scale baseline test clock time .

2. Related work

The author first discusses the routine of manual plan and early neural net. Depth convolution target detector . With the depth of modern ConvNets [ 18 ] The development of, picture OverFeat [ 32 ] and R-CNN [ 12 ] The accuracy of such an object detector has been importantly improved .OverFeat A strategy alike to the early nervous network face detector is adopted, take ConvNet As a skid windowpane detector on the image pyramid .R-CNN A strategy based on regional proposals is adopted [ 35 ], Each of these proposals is using ConvNet The scale is standardized before categorization .SPPnet [ 14 ] test, This region based detector can be more effectively applied to feature maps extracted on a single image scale. recently, more accurate detection methods, such as Fast RCNN [ 11 ] and Faster R-CNN [ 28 ] Advocate the use of features calculated from a individual scale, Because it provides a good compromise between accuracy and accelerate. however, Multiscale detection still performs better, Especially for small objects. The method of multi-level combination . Some late methods combine predictions from different layers in the network, Improved detection and cleavage .FCN [ 23 ] Through the average division probability, Combine the coarse to fine prediction of multiple layers .SSD [ 21 ] and MS-CNN [ 3 ] Predict objects at different levels of the feature hierarchy. Another method acting is to combine the characteristics of multiple layers before prediction. These methods include ace Columns [ 13 ] 、 HYPERNET [ 17 ] 、 Analytic network [ 22 ] And ions [ 2 ]. recently, there are some methods to use horizontal connection, Associate low-level feature of speech maps between resolution and semantic levels, Including for division 、 Face detection and key point estimate SharpMask [ 27 ] 、 Recombiner network [ 16 ] And stacked hourglass networks [ 25 ] .Ghiasi wait forsomeone [ 8 ] A Laplace pyramid representation is proposed, be used for FCN Gradually refine the cleavage. Although these methods implicitly or explicitly adopt a pyramid shape Architecture, But they are different from the characteristic persona pyramid [ 5,7,32 ], In these pyramids, Forecasts are made independently at all levels, See the mental picture 2. in fact, For Graphs 2( Upper human body ) Pyramid structure in, Image pyramids are even needed to identify objects on multiple scales [ 27 ] .

3. Characteristic pyramid network

Our goal is to use ConvNet Pyramid feature hierarchy ( There are semantics from broken to high ), And build a have pyramid through high-level semantics. The result feature pyramid network is universal, In this newspaper, We chiefly focus on sliding window proponents ( Regional proposal network, abbreviation RPN) [ 28 ] And region based detectors (Fast R-CNN) [ 11 ]. We will besides FPN The case division suggestions extended to the sixth function. Our method acting takes a single-scale image of arbitrary size as the remark, And the scale have function is end product at multiple levels in the way of arrant convolution. This action is independent of the backbone network structure. The writer of this article uses ResNet The Internet. The structure of our pyramid includes a bottom-up way 、 Top down paths and horizontal connections, As follows. The bottom-up path . The bottom-up way is the feedforward calculation of the spine network, It calculates the feature hierarchy composed of multiple proportional feature maps, The scale step is 2. There are normally many layers that produce output maps of the lapp size, We say these layers are in the lapp network stage. For our characteristic pyramid, We define a pyramid level for each phase. We select the output of the final layer of each stage as the character set of the feature graph, We will enrich these feature maps to create pyramids. This choice is natural, Because the deepest level of each stage should have the strongest characteristics. say concretely, about resnet [ 15 ], We use the function of the final stay block output of each phase to activate. about conv2、conv3、conv4 and conv5 Output, We express the end product of these last remaining blocks as { C2、C3、C4、C5 }, And note that they have … relative to the input signal image { 4、8、16、32 } Pixel step size. because conv1 It takes up a fortune of memory, We did n’t include it in the pyramid. Top down access and horizontal connections . The top-down path upsamples spatially coarser but semantically stronger feature maps from a higher pyramid level, so as to produce higher resolution features. then, Through cross connection, Enhance these capabilities through bottom-up paths. Each lateral connection combines feature maps of the like space size in the bottom-up path and the top-down path. bottom-up sport map has a lower semantic degree, But its energizing is more accurately localized, Because it is resampled less much. chart 3 Shows the build up blocks for building a top-down feature of speech map. For the have map with coarser resolution, We have improved the spatial resoluteness 2 times ( For the sake of ease, Use nearest neighbor sampling ). then, Compare the upsampling map with the match bottom-up map ( It passes by 1×1 Convolution layer to reduce channel size ) Merge by chemical element addition. This process will be repeated, Until the finest settlement map is generated. To start the iteration, We just need to C5 Attach a 1×1 Convolution layer, To produce the coarsest resoluteness map. last, We attach one to each merged map 3×3 gyrus to generate the final feature map, This is to reduce the aliasing effect of up sampling. The concluding specify of feature graph is called { P2, P3, P4, P5 }, Corresponding to … With the like space size { C2, C3, C4, C5 }. Because all levels of the pyramid manipulation shared classifiers like the traditional characteristic double pyramid \ Regressor, So we fix the feature dimension in all feature mappings ( The impart issue, Expressed as d). In this article, we set up d=256, So all the extra whirl layers have 256 Channel output. There is no nonlinearity in these extra layers, We have found from experience that their impact is very little. simplicity is the congress of racial equality of our design, We found our model robust to many purpose choices. We experimented with more building complex blocks ( for example, Use multiple remaining blocks [ 15 ] As the connection ), And slenderly better results were observe. Designing better connection modules is not the focus of this paper, So we choose the bare design described above .

4. application

Our method acting is a cosmopolitan solution, Used to build sport pyramids in thick networks. In the follow, We use RPN [ 28 ] The method in generates bounding box suggestions, And USES the Fast R-CNN [ 11 ] The method acting in is used for target detection. In order to prove the ease and potency of our method, We are right. [ 28,11 ] With minimal modifications to the original system, Adapt it to our characteristics .

4.1 Aim at RPN The characteristics of the network pyramid network

RPN [ 28 ] Is a sliding window class unknowable object detector. In the initial RPN In design, On the footing of single-scale whirl feature map, In dense 3×3 Evaluate a subnet on a skid window, Perform object / Non object binary categorization and boundary box regression. This is through a 3×3 Convolution level and two brothers 1×1 Convolution, For categorization and regression, We call it webhead. object \ Non object criteria and bounding box arrested development targets are defined based on a set of reference boxes called anchors [ 28 ]. Anchors have multiple predefined scales and view ratios, To cover objects of different shapes. We use FPN Replace the unmarried scale feature map to adapt to RPN. We attach a head of the lapp design to each level of the feature of speech pyramid (3×3 conv And two brothers 1×1 conv). Due to the dense skid of the head at all positions at all pyramid levels, therefore, it is not necessity to use multi-scale anchor at a specific level. contrary, We specify a unmarried scale anchor for each level. formally, We define anchoring as in { P2、P3、P4、P5、P6 } There are { 32^2、64^2、128^2、256^2、512^2 } An sphere of pixels .1 As in [ 28 ] In the like, We besides use multiple view ratios at each degree { 1:2,1:1,2:1 } The anchor of. therefore, There are … On the pyramid 15 Anchors. According to the anchor and ground-truth Joint intersection of bounding boxes (IoU) Assign training tags to anchors by proportion, such as [ 28 ] Shown. In shape, If an anchor has the highest … In a given ground truth skeleton IoU, Or in any ground truth frame IoU exceed 0.7, then the anchor is designated as a positive pronounce ; If all the ground truth frames IoU lower than 0.3, then the anchor is designated as a minus label. Please note that, The proportion of the ground truth box is not intelligibly used to allocate it to all levels of the pyramid ; contrary, The grate photograph frame is associated with the anchor, Anchors are assigned to pyramid floor. therefore, except [ 28 ] Outside the rules in, We do n’t introduce early rules. We noticed that, The parameters of the head are shared at all feature of speech pyramid levels ; We besides evaluated alternatives without shared parameters, Similar accuracy was observed. The good performance of shared parameters shows that, All levels of our pyramid share exchangeable semantic levels. This advantage is like to using a have image pyramid, The coarse heading classifier can be applied to features calculated at any picture scale .

4.2 Aim at FastRCNN Online FPN Network results

fast R-CNN [ 11 ] Is a region based object detector, Regions of interest (RoI) Pools are used to extract features. Fast R-CNN It is normally performed on a single-scale have map. In holy order to compare it with FPN Use a combination of, We need different proportions of ROI Assign to pyramid level. We think of our feature pyramid as an image pyramid. therefore, When the region based detector runs on the picture pyramid, We can adjust their allocation strategy [ 14,11 ]. formally, We set the width to w、 The acme is h( On the input double of the network ) Of RoI Assigned to the feature pyramid Pk level : here 224 It ‘s standard ImageNet Pre train size, k0 It ‘s the target tied, w×h=2242 Of RoI Should be mapped to this level. Similar to based on ResNet Faster R-CNN System [ 15 ], The system uses C4 As a one plate feature map, We will k0 Set to 4. intuitively speaking, Eqn.(1) It means, If RoI It ‘s smaller ( for example ,224 Medium 1\/2), It should be mapped to a fine resolution level ( for exercise, k=3). We will predict the head ( arrest Fast R-CNN in, Prediction headers are class specific classifiers and bounding box regressors ) Attach to all … At all levels ROI. Again, All headers have the same parameters, Whatever its floor. stay [ 15 ] in, ResNet Of conv5 layer (9 Deep subnetworks ) be used as conv4 The head at the top of the sport, But our method has been used conv5 To build a have pyramid. consequently, And [ 15 ] Different, We equitable need to use RoI Pool to extract 7×7 Features, Two obscure layers are added before the final examination classification and bounding box arrested development layer 1024-d Complete joining (fc) layer ( Behind each shock ReLU). These layers are format randomly, because resnet not pre trained in fc layer. Please note that, And standard conv5 Compared to the head, our 2-fc MLP Lighter head weight 、 Faster .

5. Target detection experiment

We are 80 class COCO Experiments were carried out on the detection data set [ 20 ]. We use 80k prepare set images and 35k val Image subset (trainval35k [ 2 ] ) Union educate, And report 5k val Image subset (minival) Ablation of. Our standard test set has not been published even (test std) [ 20 ] The end resultant role of. By convention [ 12 ], All network backbones are ImageNet1k Classification set [ 31 ] Pre train on, then finely tune the signal detection data laid. We use pre educate ResNet-50 and ResNet-101 Model, These models are public .2 Our code uses Caffe2 Re actualize py-faster-rcnn3.4 .

5.1 RPN The network gets regional advice

We are in accord with the [ 20 ] Defined in the, appraisal COCO The average recall rate of expressive style (AR) And little 、 in 、 Big object (AR、ARm and ARl) Upper AR. We reported each visualize (AR100 and AR1k)100 and 1000 The resultant role of a marriage proposal. Implementation details . surface 1 All architectures in have been trained throughout. Resize the input signal prototype, Make the shorter side have 800 Pixel. We are 8 individual GPU Synchronization is adopted on the SGD discipline. A small batch contains each GPU 2 Images and each image 256 Anchors. We use 0.0001 Weight personnel casualty and 0.9 Momentum of. front 30k The learn rate of little batch is 0.02, future 10k The learning rate of minor batch is 0.002. For all RPN experiment ( Including baseline ), We include anchor boxes outside the image for education, This is related to [ 28 ] It is different to ignore these anchor boxes in. other implementation details such as [ 28 ] Shown. stay 8GPU Upper use FPN Training RPN Need to be in COCO It costs about 8 Hours.

5.1.1 Ablation Experiment

Comparison with baseline . In order to communicate with the original RPN [ 28 ] Make a fair comparison, We use C4( And [ 15 ] identical ) or C5 The single scale map runs two baselines ( surface 1(a, b)), Both use the lapp ace parameters as us, Including the function of { 32^2,64^2,128^2,256^2,512^2 } Of 5 A scale anchor. surface 1(b) It does n’t show (a) The advantages of, This shows that a single high-level feature map is not adequate, Because there is a tradeoff between coarser resolution and stronger semantics. take FPN Included in the RPN choose AR1k Up to 56.3( surface 1(c)), Than one scale RPN The service line ( surface 1(a)) increase 8.0 A small bit. Besides, In small objects (AR1K) The operation on has been greatly improved 12.9 share. Our pyramid represents a big improvement RPN Robustness to changes in object size. Concentration from top to bottom ( Integrate ) Importance . surface 1(d) It shows the result of our feature pyramid without top-down path. Through this modification ,1×1 Transverse connection and 3×3 The convolution of is connected to the bottom-up pyramid. The computer architecture simulates the effect of reusing pyramid feature of speech hierarchy ( chart 1(b)). surface 1(d) The results in are coherent with RPN The service line is quite, Far behind our results. We speculate that this is because of the bottom-up pyramid ( chart 1(b)) There are big semantic differences between different levels of, Especially for very deep resnet. We besides evaluated the mesa 1(d) A variation of, No shared head parameters, But alike performance degradation was observed. This problem can not just be solved by a specific level of supervisor. The importance of lateral connections . surface 1(e) Shows no 1×1 Ablation results of horizontally connected top-down characteristic pyramids. This top-down pyramid has hard semantic features and all right resolution. But we think that, The placement of these features is not accurate, Because these maps have been down sampled and up sampled many times. More accurate feature of speech locations can be transferred immediately from the fine levels of the bottom-up map to the top-down map through horizontal connections. consequently, FPN Of AR1k Score comparison postpone 1(e) gamey 10 branch. The pyramid represents the importance of . People can attach their heads to P2 The highest resolving power 、 Strong semantic feature graph ( That is, the most delicate level in the pyramid ), rather of resorting to pyramids. like to single scale baseline, We assign all anchors to P2 Feature map. The variable ( surface 1(f)) Better than baseline, But lower than our method acting .RPN It is a sliding window detector with fixed window size, Therefore, scanning at the pyramid level can improve its robustness to scale changes. Besides, We noticed that, Use alone P2 Will lead to more anchors (750k, surface 1(f)), This is due to its large spatial resolution. This result shows that, A boastfully number of anchors per selenium is not sufficient to improve accuracy .

5.2 Use FastRCNN Network for target detection

adjacent, Our inquiry is based on region ( Non sliding windows ) The detector ‘s FPN. We go through COCO Average accuracy of stylus (AP) and PASCAL Style AP( Single IoU The threshold for 0.5) Evaluate target detection. We are besides based on [ 20 ] Defined in the, Reported COCO AP About small 、 Medium and large objects ( namely APs、APm and APl) The situation of. Implementation details . Resize the remark image, Make the shorter side have 800 Pixel. Sync SGD Used in 8GPU Train the model. Each small batch contains each GPU 2 Images and each visualize 512 individual ROI. We use 0.0001 Weight loss and 0.9 Momentum of. front 60k The learn rate of a small batch is 0.02, after 20k The eruditeness rate of a humble batch is 0.002. We use … For each effigy 2000 individual ROI Training ,1000 For testing. use FPN Training Fast R-CNN Need to be in COCO The data put costs about 10 Hours .

5.2.1 FastRCNN( Fixed area recommendations )

In club to better study alone FPN Impact on area based detectors, We are working on a set of fixed solutions for fast R-CNN Ablate. We choose to freeze by FPN Upper RPN Suggestions for calculation ( surface 1(c)), Because it has good operation on the humble objects recognized by the detector. For the sake of simplicity, We ‘re not hera Fast R-CNN and RPN Share functions between, Unless otherwise specified. Based on ResNet Fast R-CNN The baseline, stay [ 15 ] after, We use an output size of 14×14 Of RoI pool, And put all conv5 Layer as the hidden layer of the head. surface 2(a) Medium AP by 31.9. airfoil 2(b) Is to make practice of 2 Hidden fc Layer of MLP The service line of the head, Similar to the header in our architecture. its AP by 28.8, This shows that 2-fc Head and watch 2(a) Compared to the service line in, There is no orthogonal advantage. surface 2(c) It shows that we are fast R-CNN Medium FPN result. And watch 2(a) Compared to the service line in, Our approach will AP Improved 2.0 A little bit, Small objects AP Improved 2.1 A little bit. Use the same as 2fc The baseline of the head ( coat 2(b)) comparison, Our approach will AP Improved 5.1 A little bit .5 These comparisons show that, For region based object detectors, Our sport pyramid is better than single-scale feature. surface 2(d) and (e) indicate, Removing top-down connections or removing lateral connections can lead to poor results, Similar to what we mentioned above RPN The results observed in Section. It is worth noting that, Remove the top-down connection ( surface 2(d)) Will significantly reduce the accuracy, This shows that Fast R-CNN Using low-level features on high-resolution maps is affected. In the mesa 2(f) in, We are P2 The single thin scale sport map adopts fast R-CNN. As a result, (33.4 AP) Slightly worse than using all pyramid levels (33.9 AP, surface 2(c)). We think this is because RoI Pool is a deformed operation, Less sensible to the size of the region. Although this discrepancy has good accuracy, But it is based on { Pk } Of RPN Suggest, So I ‘ve benefited from pyramid notation .

5.2.2 FasterRCNN( Consistent regional recommendations )

In the above, We used a repair scheme to investigate the detector. But in faster R-CNN System [ 28 ] in, RPN and Fast R-CNN The like network backbone must be used, To realize affair sharing. come on 3 Shows the comparison between our method and the two baselines, They all use RPN and Fast R-CNN Consistent spinal column computer architecture. surface 3(a) It shows that we have a faster reaction to the baseline R-CNN System Replication, such as [ 15 ] Described. Under see settings, our FPN( open 3(c)) Better than this firm service line 2.3 ramify AP and 3.8 arm [ electronic mail protected ]. Please note that, surface 3(a) and (b) It ‘s a watch 3(*) in He wait forsomeone [ 15 ] Provide a much stronger baseline. We found that, The following implementation is the reason for this opening :(i) We are [ 11,15 ] Used in 800 Pixel picture scale, alternatively of 600 Pixels ; (ii) And [ 11,15 ] Medium 64 individual ROI comparison, We train each trope 512 individual ROI, This speeds up convergence ; (iii) We use 5 A scale anchor, alternatively of [ 15 ] Medium 4 individual ( lend to 322) ; (iv) At testing time, We use … For each visualize 1000 A trace, rather of [ 15 ] Medium 300 individual. consequently, And watch 3(*) in He Etc. ResNet50 Faster R-CNN Compared to baseline, Our access will AP Improved 7.6 A little bit, besides [ electronic mail protected ] share. Shared feature layer . on top, For the sake of simplicity, We do n’t share RPN and Fast R-CNN The routine between. In the postpone 5 in, We assessed [ 28 ] The shared serve after the four step training described in. And [ 28 ] like, We found that the sharing serve can slightly improve the accuracy. Function sharing can besides shorten the test prison term. The elapsed time . Share through functions, We are based on FPN Faster R-CNN The system operates in a one system NVIDIA M40 GPU Yes ResNet-50 The inference clock of each image is 0.20 irregular, Yes ResNet-101 The understand time is 0.24 second. As a comparison, come on 3(a) Single scale in ResNet-50 The baseline carry time is 0.32 second. Our method acting is through FPN The extra layer in introduces a smaller extra price, But the fountainhead is lighter. in general, Our organization is based on ResNet Of R-CNN fast. We believe that the efficiency and simplicity of our method will be conducive to future inquiry and Application .

5.2.3 And COCO In dataset SOTA Compare the Internet

We found that, surface 5 Medium ResNet-101 The model is not in full trained according to the default option learning rate plan. therefore, Fast in training R-CNN step by step, We increase the count of small batches at each learning rate 2 times. This will minival Upper AP Add to 35.6, Without sharing features. The model we submitted to COCO Check the model of the leaderboard, As shown in the table 4 Shown. Due to limited meter, We have n’t evaluated its feature sharing adaptation yet, As shown in the board 5 Shown, This should be a little better. surface 4 Compare our method with COCO The one mannequin results of the competition winners are compared, Include 2016 year winner G-RMI and 2015 Winner faster R-CNN++. Our single model goes into, Without adding any details, Has surpassed these powerful 、 well designed competitors. On the test development sic, Our method is better than the existing best results 1.3 person AP blemish (36.2 Than 34.9) and 3.4 individual AP spot [ electronic mail protected ] (59.1 Yes 55.7). It is worth noting that, Our method does not depend on the prototype pyramid, Use only a single input signal persona scale, But it even has excellent performance on small-scale objects AP. This can merely be achieved by inputting high-resolution images using the former method acting. Besides, Our approach does not take advantage of many popular improvements, such as iterative regression [ 9 ] 、 Hard negative labor [ 33 ] 、 Context modeling [ 15 ] 、 Stronger data enhancements [ 21 ] etc. . These improvements are right FPN A addendum to, It should further improve the accuracy .

6. Expand : Subdivision area suggestions

Our method is a general pyramid representation, It can be used in applications early than target detection. In this section, We use FPN Generate division suggestions, follow DeepMask\/SharpMask frame [ 26,27 ]. DeepMask\/SharpMask Training on image crops, To predict example fragments and objects \ Non aim seduce. In reasoning, These models are convoluted, To generate dense suggestions in the image. To generate clips at multiple scales, Image pyramid is necessary [ 26,27 ]. It ‘s easy to adjust FPN To generate mask suggestions. We use the setting of complete whirl for aim and argue. We build feature pyramids, As the first .5.1 And set up d=128. On every level of the feature pyramid, We use a modest 5×5 MLP, Predict in complete whirl 14×14 Mask and object scores, See the mental picture 4. Besides, Because in [ 26,27 ] Use every octave in the effigy pyramid 2 It ‘s a scale, We use the stimulation size of 7×7 The second MLP To handle half octaves. these two items. MLP stay RPN Play a similar character in. architecture is end-to-end train ; Complete implementation details are given in the appendix.

6.1 The subdivision region suggests experimental results

Results such as table 6 Shown. We reported on humble 、 Medium and large objects AR Paragraph and AR paragraph, always be 1000 A proposal. Our service line FPN The model has a single 5×5 MLP, AR by 43.4. Switch to a slenderly larger 7×7 MLP, The accuracy remains basically the lapp. Use two at the lapp time MLP The accuracy can be improved to 45.7 AR. Change the dissemble output size from 14×14 Add to 28×28 Can be AR Add to another point ( Larger sizes begin to reduce accuracy ). stopping point, Double the number of training iterations AR Add to 48.1. We besides report on DeepMask [ 26 ] 、SharpMask [ 27 ] and InstanceFCN [ 4 ] Comparison, This is the most advance method of generating mask recommendations before. We are more accurate than these methods 8.3 individual AR spot. particularly, Our accuracy on small objects has about doubled. Existing mask proposal methods [ 26,27,4 ] Image pyramid based on dense sampling ( for exemplar, Press 2 The rapid climb ) { − [ 26,27 ] Medium 2:0.5:1 }, This makes them computationally expensive. Our approach is based on FPN, It ‘s much faster ( Our model is based on 4 To 6 fps Speed of operation ). These results suggest that, Our model is a cosmopolitan have extraction tool, It can replace image pyramid for other multi-scale detection problems .

7. summary

We provide a concise framework, Used in ConvNet Build a functional pyramid in. Our method shows, Compared with several strong baselines and competitive winners, There has been a significant improvement. therefore, It provides a practical solution for the research and lotion of feature pyramid, Without calculating the prototype pyramid. last, Our research shows that, even though bass ConvNets It has firm representation ability and implicit robustness to scale changes, however, using pyramid representation to explicitly solve multi-scale problems is still identical crucial .

appendix

A. The implementation details of the subdivision area suggestion experiment

We use our feature pyramid network to efficiently generate object cleavage suggestions, An image centered training strategy popular in object signal detection is adopted [ 11,28 ]. our FPN The mask generation model inherits DeepMask\/SharpMask [ 26,27 ] Many of my thoughts and motives. however, Unlike these models, We perform complete whirl discipline on the have pyramid to predict the masquerade, These models are trained on visualize crops, The visualize pyramid with dense sampling is used for reasoning. Although this needs to change many details, But our realization is still in heart with DeepMask like. say concretely, To define the tag of the mask case in each sliding windowpane, We think of this window as a clip on the input signal image, Allow us to start from DeepMask Inherit the incontrovertible / Definition of negative film. Next we will give more details, See besides picture 4 Visualization. We use it P2 Structural feature pyramid −6. Use the same architecture as described in section .5.1. We will d=128. Each layer of our feature pyramid is used to predict the masquerade at different scales. stay DeepMask in, We define the scale of the mask as the maximum of its width and acme. The proportion is { 32、64、128、256、512 } The masks of pixels are mapped to { P2、P3、P4、P5、P6 }, And by the 5×5 MLP Handle. because DeepMask Using a pyramid of half an octave, We use the moment, slightly larger MLP, The size is 7×7(7≈ 5.√2) share with half octave in our model ( for example ,128√2 The scale mask consists of P4 Upper 7×7 MLP calculate ). Objects in the middle scale are mapped to the nearest scale in the log distance. because MLP Must be within a certain crop of each pyramid level ( Especially half octave ) Prediction object, Therefore, some padding must be provided around the size of the basic object. We use 25 % The filler of the. It means { P2, P3, P4, P5, P6 } The mask output on is mapped to 5×5 MLP Of { 40,80,160,320,640 } The size of the picture area ( and √2 A larger corresponding size ( Apply to 7×7 MLP). Each spatial position in the feature map is used to predict the masks at different positions. To be particular, In proportion Pk Next, Each spatial location in the feature map is used to predict that the center is located at this location 2k Mask in pixel roll ( Corresponding to … In the feature graph ±1 Cell offset ). If no aim center is in this rate, then the position is regarded as minus, And with DeepMask In the same, For training score branches alone, alternatively of masking branches.

What we use to predict masks and scores MLP It ‘s simple. We applied a 5×5 The kernel of, Yes 512 Outputs, then there is the layer where brothers are amply connected, To predict 14×14 The mask of (142 Outputs ) And object scores (1 Outputs ). The model is implemented by complete whirl ( Use 1×1 Convolution replaces fully connected layers ). Processing half octave objects 7×7 MLP And 5×5 MLP identical, But its input area is larger. In the process of train, We use 1:3 Positive of / Negative sampling pace, Every minor batch is randomly selected 2048 Samples (16 Each of the images is distill 128 Samples ). The weight of mask loss is higher than that of score passing 10 times. The exemplary uses synchronism SGD( Every GPU 2 Images ) stay 8 individual GPU End to end prepare on. We from 0.03 The teach rate started to, Training 80k A little batch, stay 60k After a modest batch, Divide the learn pace by 10. During prepare and testing, Set the visualize scale to 800 Pixels ( We do n’t use scale jitter ). In the summons of intelligent, Our complete gyrus model predicts the scores of all positions, angstrom well as 1000 Level and dissemble of the highest score position. We do not perform any not maximum suppression or post-processing .

This article is only for future review, There is no other use .

Leave a Reply

Your email address will not be published.