Academia.eduAcademia.edu

Fig. 2. Category of previous deep SOD models. (a) MLP-based methods; (b)-(f) FCN-based methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f) branch network architectures. (g) Hybrid network-based methods. See §2.1 for more detailed descriptions.  trained image classification network. The lateral inhibition enhances the discriminative ability of the attention maps, releasing it from the need of SOD annotations.  e RSDNet-R [90] combines an initial coarse representation with finer features at earlier layers under a gating mecha- nism to stage-wisely refine the side-outputs. Maps from all the stages are fused to obtain the overall saliency map.  2) Multi-stream network, as depicted in Fig. 2 (c), typically has multiple network streams, each of which is trained with an input at a particular resolution to explicitly learn multi- scale saliency features. The outputs from different streams are then combined together for the final prediction.  4) Bottom-up/top-down network refines the rough saliency estimation in the feed-forward pass by progressively incor- porating spatial-detail-rich features from lower layers, and produces the final map at the top-most layer (see Fig. 2 (e)). e DHSNet [37] refines the coarse saliency map by gradually combining shallower features using recurrent layers, where  all the intermediate maps are supervised by the ground truth saliency maps [100].

Figure 2 Category of previous deep SOD models. (a) MLP-based methods; (b)-(f) FCN-based methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f) branch network architectures. (g) Hybrid network-based methods. See §2.1 for more detailed descriptions. trained image classification network. The lateral inhibition enhances the discriminative ability of the attention maps, releasing it from the need of SOD annotations. e RSDNet-R [90] combines an initial coarse representation with finer features at earlier layers under a gating mecha- nism to stage-wisely refine the side-outputs. Maps from all the stages are fused to obtain the overall saliency map. 2) Multi-stream network, as depicted in Fig. 2 (c), typically has multiple network streams, each of which is trained with an input at a particular resolution to explicitly learn multi- scale saliency features. The outputs from different streams are then combined together for the final prediction. 4) Bottom-up/top-down network refines the rough saliency estimation in the feed-forward pass by progressively incor- porating spatial-detail-rich features from lower layers, and produces the final map at the top-most layer (see Fig. 2 (e)). e DHSNet [37] refines the coarse saliency map by gradually combining shallower features using recurrent layers, where all the intermediate maps are supervised by the ground truth saliency maps [100].