Learning Memory Propagation And Matching For Semi-Supervised Video Object Segmentation

This paper studies the task of semi-supervised video object segmentation (VOS). Multiple works have shown the outstanding performance of the memory retrieval method based on matching, which performs temporal and spatial pixel-level matching, but does not pay attention to the temporal relationship of the frames. To this end, we propose a memory propagation and matching (MPM) method, combining the propagation-based method and matching-based method simultaneously, to reduce some wrong matching and maintain the consistency between adjacent frames and make the model more robust to occlusion and object disappearance and reproduction. Inspired by the remarkable e ﬀ ect of recurrent neural network (RNN) based methods in video tasks, we proposed memory propagation (MP) module which uses Convolution Gate Recurrent Unit (ConvGRU) for memory propagation, and the memory reﬁnement is carried out when the target frame is segmented. At the same time, MPM matches the target frame with the ﬁrst frame and the previous adjacent frame. The multi-object matching (MOM) module calculates the probability matrix of each pixel belonging to each object, so that the MPM model can e ﬀ ectively distinguish di ﬀ erent objects. Experiments show that the MPM model has achieved J & F 82.8% on DAVIS 2017 Validation dataset and J & F 80.1% on YouTube-VOS dataset.


Introduction
Video object segmentation is a basic task in the field of computer vision, which is widely used in video editing, video synthesis and automatic driving. This paper focuses on semi-supervised video object segmentation, in which, the subsequent video frames are segmented according to the given mask of the first frame of the sequence. There are many challenges in video object segmentation, such as occlusion, object disappearance and reproduction, large-scale changes and morphological changes of objects. And there may be one or more objects in a video sequence, when there are multiple objects in the video frame, it is easy to confuse the segmentation results of multiple objects.
1 In recent years, in the task of semi-supervised video object segmentation, there are propagation-based methods Lin et al, 2019;Oh et al, 2018) and detection-based methods (Chen et al, 2018b;Seong et al, 2020;Voigtlaender et al, 2019;Caelles et al, 2017;Oh et al, 2019;Voigtlaender and Leibe, 2017;Hu et al, 2018b;Yang et al, 2020). Propagation-based methods propagate the segmentation mask of reference frames to subsequent frames. However, as shown in Fig. 1a, on the left there is an occluded object, and on the right there is a object that disappears and reproduces in the sequence. It can be seen from the figure that the propagation-based method does not perform well in the above two cases. In detection-based methods, there are some methods Voigtlaender and Leibe, 2017) instead of using temporal information, learn an appearance model to detect and segment the object at the pixel level in each frame. During the inference, using the trained model to fine tune the first frame mask, which is effective but is very time-consuming. Among them, the matching-based methods (Son et al, 2015;Chen et al, 2018b;Hu et al, 2018b) generate the template of the object of interest according to the object annotation of the first frame, then match the template with the target frame and guide the segmentation by calculating the similarity matrix. Due to the lack of inter-frame consistency, the false matching in the background and object confusion often occur as show in Fig. 1b. In view of the above problems, we propose a method combining propagation and matching, integrating the two methods to complete the task of semi-supervised video object segmentation.
In the field of unsupervised video object segmentation, LVO (Tokmakov et al, 2017)u s e sb i d i r e ctional ConvGRU (Ballas et al, 2015) to memorize appearance features and optical flow features at the same time. PDB-ConvLSTM (Song et al, 2018)uses bidirectional Convolution Long-Short Term Memory (ConvLSTM) to learn spatio-temporal features. The above two methods have achieved remarkable results in the field of unsupervised VOS, but in semi-supervised VOS, the using of RNN methods does not give full play to their advantages. The existing STM series methods (Hu et al, 2021;Oh et al, 2019;Wang et al, 2021;Xie et al, 2021) that achieve the best results in the field of semisupervised VOS use the spatio-temporal attention  mechanism to perform pixel by pixel global feature matching in the target frame and historical frames. Compared with this method, the memory method based on RNN can memorize features frame by frame so as to track the spatio-temporal consistency of moving objects to a certain extent. Inspired by the above methods, this paper proposes a Memory Propagation (MP) module, ConvGRU is used to remember the features of the historical frames, and the memory features are guided to propagate to the target frame according to the similarity between the previous adjacent frame and the target frame. Experiments show that the MP module proposed in this paper can fully propagate the time clues and guide the segmentation of the target frame.
And considering the severe challenges in VOS task, occlusion and object disappearance and reproduction, we proposes a Multi-object Matching (MOM) module, which firstly matches the object frame with the first frame and the previous adjacent frame, and then uses the feature of the first frame to guide the intra-matching of the target frame based on the accurate object information in the first frame mask. Intra-matching takes the probability into account that each pixel belongs to multiple objects to solve the problem of object confusion as much as possible.
In addition, we also propose a High Frequency Refine (HFR) module to pay attention to the edge information of the object, to make the segmented object edge more complete. 2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 The contributions of this paper are summarized as follows: • We propose a Memory Propagation module using temporal continuity information to guide the segmentation of the target frame. • We propose a Multi-object Matching module which combines inter-matching and intramatching searching target pixel to solve the problems of occlusion and object reproduction. • We propose a High Frequency Refine module to make the model pay more attention to the edge information to produce higher quality segmentation results. • Experiments show that the proposed method achieves very competitive results on DAVIS 2017 and YouTube-VOS datasets.
2R e l a t e dW o r k s

Region proposal methods
Inspired by the tasks of image object detection and video object tracking, some methods (Luiten et al, 2019;He et al, 2020;Li and Loy, 2018;Huang et al, 2020) in VOS use the method of proposing candidate regions for segmentation. Some of them (He et al, 2020;Huang et al, 2020) are two-stage training, and some (Li and Loy, 2018) are end-to-end training. DTTM-TAN (Huang et al, 2020) use 3D convolution to extract the features of consecutive frames, and do spatio-temporal aggregation with the features of the target frame. Then generate multiple proposals on the aggregated features, and then match them with the templates in the dynamically updated template bank. PRe-MVOS (Luiten et al, 2019) uses MASK RCNN (He et al, 2020) to generate coarse mask suggestions, and conducts refinement and re-identification to achieve a high performance. DyeNet (Li and Loy, 2018) uses RPN (Ren et al, 2017) to extract the proposal, and uses RE-ID module to connect the proposal with cyclic mask propagation. In FTMU (Yang and Chan, 2018), reinforcement learning is used to decide which matching-based method to perform on the proposal, matching based on IOU or matching based on appearance. Another role of reinforcement learning is to decide whether to update the template used for matching. The method based on region proposal heavily depends on the pre-trained detector, and multiple thresholds are usually set in pipeline.
The model is too complex, and mostly end to end training is not possible.

Propagation-based methods
The method based on propagation propagates the information of the historical frame to the target frame to assist in the segmentation of the target frame. DIPNet (Hu et al, 2020) decomposes the VOS task into dynamic propagation stage and spatial segmentation stage at each time step. In the dynamic propagation stage, a new object is used to represent the reference information from the adaptive propagation object, which enhances the robustness of video over time. DyeNet (Li and Loy, 2018) integrates template matching into the re-recognition network and integrates FlowNet (Ilg et al, 2017), mask propagation using optical flow information and bidirectional RNN makes its training complex. DTMNet (Zhang et al, 2020) stored the short-term and long-term video sequence information before the target frame as time memory for the purpose of modeling temporal information. The propagation based method can track the temporal continuity of the object well when the object changes smoothly, but it is also prone to false propagation and lack of robustness to the occlusion problem.

Matching-based methods
The matching based method performs pixel by pixel matching between the reference frames and the target frame. SSM (Zhu et al, 2021) method not only captures the pixel level similarity relationship between the reference frame and the target frame, but also reveals the separable structure of the specified object in the target frame. CFBI  encodes embedding features from the foreground and background, the matching between reference frames and target frame from pixel level and instance level is conducted, so that CFBI is robust to various object scales. FEELVOS (Chen et al, 2018a) proposed global and local matching according to the distance value. Method based on STM series (Hu et al, 2021;Oh et al, 2019;Wang et al, 2021;Xie et al, 2021) is a memory retrieval method, which is a spatio-temporal matching mechanism between the target frame and many historical frames. This method has achieved very remarkable results in the field of semi-supervised VOS. The matching based method is fast and can well deal with the problems of occlusion and object disappearance and reproduction, but it is also easy to cause wrong matching due to the lack of temporal continuity information.

Propagation in similar fields
In some fields similar to VOS task, the use of memory propagation methods is different, but they have achieved quite good results in their fields, which also plays an enlightening role in this paper. In the field of video tracking, RFL (Yang and Chan, 2017) modifies the RNN method to ConvLSTM, which is used to generate a filter for specific objects. Mem-Track (Yang and Chan, 2018) proposes a dynamic memory network in the tracking process to make the template adapt to the appearance change of the object. EVS (Paul et al, 2020) propagates optical flow, features and masks frame by frame. LER-Net (Wu et al, 2020) mine rich features in the key frame, and calculate the overall attention with the subsequent non-key frame to spread the consistency information across frames in real time.

Methods
This section describes the specific design of the MPM model. Firstly, the overall network structure and network processing pipeline are introduced in Section 3.1. Then, the sub-network designs of Memory Propagation module and Multi-object Matching module are introduced in Sections 3.2 and 3.3 respectively. The sub-network design of High Frequency Refine module is described in Section 3.4.

Overview
The overall network structure is shown in Fig. 2.
In this section, we change the target frame into another name, that is query frame. Firstly, the reference frame and query frame are sent to the Encoder (ResNet50 (He et al, 2016) is regarded as our backbone, that is, Encoder) for feature extraction, we use the layer 4 output res4 for subsequent matching and propagation, res4 2 R H×W ×C and both H and W are 1 16 of the input video frame size. Then 3 ⇥ 3 convolution is used on encoder to generate feature embedding, Embedding 2 R H×W ×C/2 . Then, starting from the first frame, every Embedding is sent into ConvGRUCell. The initial hidden tensor of ConvGRUCell is zero tensor.
Then we use ConvGRU to remember Embedding frame by frame until the previous adjacent frame of the query frame. The structure of ConvGRU-Cell is similar to the LVO (Tokmakov et al, 2017) model. Convolutions in ConvGRUCell are all 5⇥5 in size. Then, the output of the final hidden layer, the Embeddings of the previous adjacent frame and query frame are sent to MP module for memory propagation.
In addition, the first frame, the previous adjacent frame and the query frame also use 3⇥3 convolution to generate a Value as the input of MOM module. Then, the three Values are sent to the MOM module for matching.
We also propose the HFM module to make the network pay more attention to the object edge. The input of this module is the Valueof the query frame and the result of 1 ⇥ 1 convolution conducts on the encoder output of the query frame.
Finally, the outputs of the three modules are sent to the decoder for generating the final segmentation mask, we also use some skip connections to merge low-level features. The same decoder structure as the STM (Oh et al, 2019) model are used in our MPM model. The convolution weights of the reference frames and query frame are not shared.

Memory propogation module
The structure diagram of the MP module is shown in Fig. 3. The basic idea of the MP module is to use the similarity between the previous adjacent frame and the query frame to guide memory propagation. We define E P as the Embedding of the previous adjacent frame, E Q as the Embedding of the query frame, and h as the output hidden feature of the ConvGRU. Each pixel feature in E P and E Q can be regarded as a C/2 dimensional feature vector whose size is H ⇥ W . Accordingly, similarity matrix is calculated between E P and E Q . The similarity matrix is calculated by using the corresponding C/2 dimensional feature vector to calculate the cosine similarity, then performs exponential operation e (x) to make it positive, and then it's divided by the maximum value Max to normalize it between 0 and 1. The reason we use the Max operation instead of using the Sigmoid function is that we believes that the relative similarity within the frame is more representative. Then, memory propagation is guided according to the similarity matrix. The calculation   2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Fig. 2 Architecture overview of MPM. Embedding of past frames is memorized through ConvGRU. Memory Propagation (MP) module guides memory propagation through the similarity between the previous adjacent frame and the query frame. Multi-object Matching (MOM) module conducts spatio-temporal retrieval and intra-frame matching guided by the first frame. High Frequency Refine (HFR) module stands out the high frequency information of the query frame.
of similarity matrix and feature propagation process are defined as Eq. (1) and Eq. (2). (1) Where Conv stands for convolution operation and Concat stands for concatenation operation, i is pixel index, S p denotes the similarity matrix between the query frame and the previous adjacent frame, S p 2 R H×W . F MP is the memory propagation feature output by MP module, F MP 2 R H×W ×C/2 . The query frame is more similar to the previous adjacent frame, the weight of memory propagation is greater, The more dissimilar, the weight of memory update is greater. Then, the number of channels is changed to C/2 through 1 ⇥ 1 convolution, and then multi-scale feature propagation is carried out through ASPP (Chen et al, 2018a) module, and the output of the ASPP which has H ⇥ W ⇥ C/2 dimension is used as the output of this module.

Multi-object matching module
The MOM module proposed in this paper is divided into two parts. The first part is the Spatiotemporal Retrieval (STR), and the second part is the calculation of Multi-object Matching Probability (MOMP). We will discuss them in 3.3.1 and 3.3.2

Spatio-temporal retrieval
Inspired by STM (Oh et al, 2019), we designed the STR module as shown on the right side of Fig. 4, but we only used the first frame and the previous adjacent frame as the memory. We define V F as the Value of the first frame, V P as the Value of the previous adjacent frame and V Q as the Value of the query frame. Firstly, V Q and the concatenated V F and V P are matched to get the spatial-temporal similarity, and then the Softmax operation is performed in the memory dimension. Then the memory retrieval operation is carried out according to the similarity matrix. Finally, the features of memory retrieval are output. The whole process is defined  as Eq. (3) and Eq. (4).
Where i and j are pixel indices, V M denotes the memory feature (results after concatenation of V F and V P ), V M 2 R T ×H×W ×C/2 , T = 2, denotes matrix inner production, and S T is the similarity matrix between the query frame and the memory, S T 2 R THW×HW . F STR is the retrieval feature output by STR module, F STR 2 R H×W ×C/2 .

Multi-object matching probability
The MOMP proposed in this paper firstly guide the intra-frame matching according to the interframe matching between the first frame and the query frame, and finally calculate the probability that each pixel in the Value of the query frame belongs to multiple objects according to the intraframe matching. The reason why the first frame is selected for inter-frame matching is that the mask of the first frame contains accurate object information. The whole process of Multi-object Matching Probability is shown on the left side of Fig. 4.T h e process of inter-frame matching is as Eq. (5).
Where fi and j are the indices, S M (j)i st h es u m of the similarity between a pixel of the query frame and all pixels in the first frame, S M 2 R H×W . According to the ground truth mask of the first frame, filter out the foreground features of the first frame, then calculating the similarity matrix and obtain the summation result. Then, k query frame feature vectors with the greatest similarity to the foreground pixels in the first frame are selected from V Q according to S M , as Eq. (6)s h o w s .
V topk Q is the selected K pixel features most similar to the foreground pixels of the first frame, and the size of V topk Q is K ⇥ C.
Similar to inter-frame matching, let k key vectors V topk Q and V Q for intra-frame matching. The obtained similarity matrix Prob is averaged at the dimension of K. Then, multiple objects operate in this way, and the resulting Prob matrix is passed through soft aggregation (Oh et al, 2019) calculates the probability of background. Then, the probability of each pixel belonging to each category (including background) is calculated by Softmax. Finally, the probability of multiple objects is taken out as the output (the background probability only helps to calculate the probability that each pixel belongs to multiple objects, and will not be used later). At last, the multi-object probability Prob and the Valueof the query frame V Q are multiplied as a spatial attention operation, and then output this feature F MOMP , F MOMP 2 R H×W ×C/2 .

High fequency refine module
The HFR module is shown in Fig. 5.Inspiredbythe paper , this module pays attention to the high frequency information of the query frame to improve the segmentation quality. We apply a 3⇥3 convolution and a 1⇥1 convolution respectively on the encoder output of query frame, where the result of 3 ⇥ 3 convolution is the Embedding of the query frame shown in Fig. 1.
The processing of this module is as Eq. (9).
We subtract the result of 3 ⇥ 3 convolution from the result of 1⇥1 convolution, and then use the Sigmoid function to highlight the edge information of the query frame. Then the result obtained is multiplied by the 1 ⇥ 1 convoluted feature map to obtain the feature F HFR after high frequency refine, F HFR 2 R H×W ×C/2 .

4E x p e r i m e n t s
In this section, firstly, the dataset and evaluation metrics are introduced in Section 4.1,theimplementation details of the experiment are introduced in Section 4.2, and the ablation study is described in Section 4.3 to illustrate the contribution of different modules proposed in MPM. In section 4.4,w e report the evaluation results on benchmarks.  annotation frames, one of which contains one or more objects. According to the standard of DAVIS dataset, this paper uses the average J index, the average boundary F score, and the average J &F to evaluate the accuracy of segmentation. J score calculates the average Intersection over Union (IoU) between the prediction and the ground truth mask. F score calculates an average boundary similarity between the boundary of the prediction and the ground truth mask, and J &F is the average of J and F. In addition, frame rate per second (FPS) is used to measure the segmentation speed.

YouTube-VOS. YouTube-VOS(Xu et al, 2018)
is a large video object segmentation dataset, including 4453 videos with multiple object annotations. Its validation set has 474 sequences covering 91 object classes, 26 of which are not seen in the training set. On YouTube-VOS, this paper reports the total score of accuracy evaluation of J &F,w h i c h is the index average of object classes that have be seen and have not be seen in the training set.

Training and inference
In this section, we explain training and inference. Multi-object segmentation method is described in 4.2.1,in4.2.2, we introduce the training strategy.

Multi-object segmentation method
According to the disjoint constraint of multiple objects, that is, each pixel can only belong to one object, this paper establishes a MOM module. This paper adopts the widely used Softmax classifier for classification. In fact, the multi-object processing of the network in this paper is also treated as a single object, but in the MOM module, this paper uses Softmax to calculate the probability that each pixel belongs to multiple objects, and soft aggregation is used to merge the segmentation results of multiple objects similar to Oh et al (2019). When calculating the training loss, we calculates the cross entropy loss of multiple classification, instead of calculating the loss of two categories like Zhu et al (2021) In the VOS task, the number of objects in each video sequence is not uniform. This paper uses the batch advantage of the existing deep learning framework PyTorch to solve this problem. All training is implemented on 1 NVIDIA GeForce RTX 2080 Ti GPU.
The test is carried out on 1 NVIDIA GeForce RTX 2080 Ti GPU. Pre-training on static images. Following SwiftNet (Wang et al, 2021), we firstly pre-train the the network MPM on the 4-frame pseudo video sequence generated on the MS-COCO (Lin et al, 2014) dataset. In the pre-training stage, the input image size is set to 384⇥384. This paper uses Adam optimizer, and the learning rate starts from 5e-5. The learning rate is adjusted by polynomial scheduling. During training phase, all batch normalization layers in the backbone are fixed at their ImageNet pre-training values. In this paper, the batch size is 4, which is realized on 1 GPU by manual accumulation. The way to generate pseudo video sequences from image dataset is to randomly extract the foreground object from a static image, and then paste it onto a randomly sampled background image to form a new image. In this paper, affine transformations such as rotation, resize, clipping and translation are applied to the foreground and background to generate deformation and occlusion respectively. MPM performs 250K iterations with pseudo video sequences. After pre-training, J &F achieved 75.2% on DAVIS 2017 Validation dataset, which proves the effectiveness of the pre-training.
Fine-tuning on real video sequences. After the pre-training, we conduct 450K iterations on DAVIS 2017 and Youtube-VOS. In each iteration, we randomly samples 4 frames continuously (random skipped frames are less than or equal to 4 frames), and estimates the segmentation mask frame by frame. From the second frame, the segmentation mask of the previous frame (MPM's output) is sent to the network for the segmentation of subsequent frames. At the beginning of training, the maximum number of randomly skipped frames is 4 frames on DAVIS dataset. Due to fast movement, the maximum number of skipped frames on YouTube-VOS dataset is 2 frames. Then, every 20K iterations, the number of random skips is reduced by 1 until the number of random skips is reduced to 0. The reason why this paper adopts the method of decreasing random skip number for training is that the large skipped number in the first 20k iterations is conductive to the training of MOM module. The subsequent random skip number is kept at 0 to ensure the continuity of 4 sampling frames, which is conducive to the training of MP module.

Ablation study
In this section, we show the results of the ablation study. The ablation study is divided into two parts. The first part is the ablation study about hyperparameter K in the calculation of the Multi-object Matching Probability. The second part is to verify the effectiveness of the sub-module in the whole MPM model. All ablation studies were validated on the DAVIS 2017 Validation dataset.
Ta b l e 1 Ablation Study for K in Multi-object Matching Probability on DAVIS 2017 Validation dataset. Hyperparameter. Table 1 shows the verification results of hyperparameter K.W h e nK is 8, K is 16 and K is 48, the J &F index of MPM achieved 81.4%, 81.9% and 82.4% respectively. When K is 32, the J &F index of MPM reached the optimum and achieved 82.8%. Due to memory constraints, we did not conduct ablation study with K greater than 48. We take K = 32 as our final MPM model.
Network Sub-module. Table 2 shows the ablation study results of the sub-module in MPM. When the Memory Propagation (MP) module is removed, J &F drops to 80.7% (-2.1%). When the Multiobject Matching (MOM) module is removed, J &F degrades to 77.2%, which decreased by 5.6% compared with the overall network structure(MPM). When the High Frequency Refine (HFR) module is removed, J &F drops to 82.6% (-0.2%). Our MOM module consists of two parts: Multiobject Matching Probability (MOMP) and Spatiotemporal Retrieval (STR). In order to verify the effectiveness of MOMP, we also conducted ablation study on this. When the MOMP is removed, J &F achieved 82.5% (-0.3%). Since STR provides parameter update for MOMJ during training, we did not do the experiment of removing STR alone. Figure 6 shows some visualization results of removing each module. It can be seen from the camel sequence and loading sequence that removing MP and HFR modules will lead to a large degree of 2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64    wrong matching. It can also be seen from the loading sequence that removing the MOM module easily leads to incomplete segmentation results.
This ablation study verifies that the Multi-object Matching module makes a great contribution to our MPM model, and other modules also contribute to the final segmentation results.

Evaluations on benchmarks
DAVIS 2017. As shown in Table 3, our method MPM achieves 82.8% on the DAVIS 2017 Validation dataset, which is same to the previous state-of-theart model GraphMem (Lu et al, 2020). Compared with other recent methods, MPM also achieves the most advanced performance. STM (Oh et al, 2019)is a memory retrieval method, which needs to sample historical frames to construct memory, resulting in larger memory occupation and slower segmentation time as the segmentation progresses. The segmentation time and memory occupation of our method do not increase with time. In this case, we still exceed STM by 1% on the J &F index. AFB URR (Liang et al, 2020) aiming at the shortcomings of STM, an adaptive feature bank is proposed to dynamically absorb new features and discard obsolete features. This method is not trained with YouTube-VOS sequences, and the J &F index reaches 74.6%. The training result of our MPM model without YouTube-VOS data is 78.0%, which also outperforms AFB URR by 3.4%. GC  also improves the deficiencies of STM. A fixed size feature representation is proposed to replace the using of many previous frames in STM. Although its segmentation speed is fast, the J &F index does not exceed our training results without YouTube-VOS for training (71.4%vs.78.0%). We also reported the segmentation result on DAVIS 2017 Test-dev dataset, and the J &F index also reaches the best result of 75.2%, significantly exceeding STM (+3%).
In short, the quantitative results prove that our method has achieved competitive results on DAVIS 2017. Table 4, since DAVIS 2016 is a single object dataset, the evaluation on this dataset is not affected by the interaction of multiple objects, so its performance highly depends on the accuracy of segmentation details. On this dataset, our method also achieves the optimal results. We also report the frame rate per second (FPS) on this dataset. For fair comparison with other methods, the FPS result is runned on 1 Tesla P100 GPU. Although our FPS index is not as good as RANet (Wang et al, 2019), GC, A-GAME (Johnander et al, 2019), our J &F result (no training on YouTube-VOS) exceeds RANet by 2%. It exceeds GC by 0.9% and A-GAME by 5.4%, and the J &F index exceeds STM by 1% when there is no YouTube-VOS to train, our MPM achieves the same J &F with STM when YouTube-VOS is added to train, but MPM get the faster segmentation speed, which proves that the method in this paper achieves competitive results both in accuracy and speed.  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 YouTube-VOS. As shown in Table 5, our MPM model also achieves quite good results on YouTube-VOS. The total J &F index reached 80.1%. Compared with GraphMem, our J &F index is 0.1% lower. By comparing the indexes on seen objects and unseen objects with GraphMem, we find that although our MPM method has lower J index and F index on seen objects than GraphMem, our method has 0.4% higher J index and 0.7% higher F index on unseen objects than Graph-Mem. This is enough to prove that our method has good generalization performance. In addition, the J &F index of our method exceeds AFB URR by 0.5% and STM by 0.7%. YouTube-VOS is a largescale video dataset with many kinds of videos. The results obtained by MPM are enough to prove the superiority of this method.

DAVIS 2016. As shown in
Qualitative results. Figure 7 shows some qualitative results on DAVIS 2017 and YouTube-VOS.
The first two lines in the figure are the comparison between MPM and RGMP (Oh et al, 2018). RGMP is a propagation-based method, when object moves rapidly, it is easy to segment incompletely. Our method gets better segmentation results in this case. STM and GraphMem are matching-based methods, compared with STM and GraphMem, the result of STM and GraphMem produce false matching of objects in background and confusion among segmentation results of multiple objects, while our method can reduce some false matching relatively.

Conclusion
In this paper, the task of semi-supervised video object segmentation is studied, and a network based on memory propagation and matching (MPM) is proposed,which combines the propagation-based method and matching-based method. The Memory Propagation (MP) module is proposed to propagate time continuity information, and Multi-object 2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Matching (MOM) module is proposed to solve the problems of occlusion and object disappearance and reproduction. In addition, this paper also proposes a High Frequency Refine (HFR) module to refine the edge, so as to make the segmentation results better. In short, this method achieves competitive performance in the VOS benchmark.