Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 2 – uploaded by Qiuxia Lai

See full PDF downloadDownload figure

Fig. 2. Category of previous deep SOD models. (a) MLP-based methods; (b)-(f) FCN-based methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f) branch network architectures. (g) Hybrid network-based methods. See §2.1 for more detailed descriptions. trained image classification network. The lateral inhibition enhances the discriminative ability of the attention maps, releasing it from the need of SOD annotations. e RSDNet-R [90] combines an initial coarse representation with finer features at earlier layers under a gating mecha- nism to stage-wisely refine the side-outputs. Maps from all the stages are fused to obtain the overall saliency map. 2) Multi-stream network, as depicted in Fig. 2 (c), typically has multiple network streams, each of which is trained with an input at a particular resolution to explicitly learn multi- scale saliency features. The outputs from different streams are then combined together for the final prediction. 4) Bottom-up/top-down network refines the rough saliency estimation in the feed-forward pass by progressively incor- porating spatial-detail-rich features from lower layers, and produces the final map at the top-most layer (see Fig. 2 (e)). e DHSNet [37] refines the coarse saliency map by gradually combining shallower features using recurrent layers, where all the intermediate maps are supervised by the ground truth saliency maps [100]. — Figure 2 Category of previous deep SOD models. (a) MLP-based methods; (b)-(f) FCN-based methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f) branch network architectures. (g) Hybrid network-based methods. See §2.1 for more detailed descriptions. trained image classification network. The lateral inhibition enhances the discriminative ability of the attention maps, releasing it from the need of SOD annotations. e RSDNet-R [90] combines an initial coarse representation with finer features at earlier layers under a gating mecha- nism to stage-wisely refine the side-outputs. Maps from all the stages are fused to obtain the overall saliency map. 2) Multi-stream network, as depicted in Fig. 2 (c), typically has multiple network streams, each of which is trained with an input at a particular resolution to explicitly learn multi- scale saliency features. The outputs from different streams are then combined together for the final prediction. 4) Bottom-up/top-down network refines the rough saliency estimation in the feed-forward pass by progressively incor- porating spatial-detail-rich features from lower layers, and produces the final map at the top-most layer (see Fig. 2 (e)). e DHSNet [37] refines the coarse saliency map by gradually combining shallower features using recurrent layers, where all the intermediate maps are supervised by the ground truth saliency maps [100].

Related Figures (13)

Fig. 1. A brief chronology of salient object detection (SOD). The very first SOD models date back to the work of Liu et a/. [29] and Achanta et al. [30]. The first incorporation of deep learning techniques in SOD models is from 2015. See §1.1 for more detailed descriptions.

Summary of previous reviews. See §1.2 for more detailed descriptions. 1.2 Related Previous Reviews and Surveys With the compelling success of deep learning technolo- gies in computer vision, more and more deep learning- based SOD methods have been springing up since 2015. Earlier deep SOD models typically utilize multi-layer per- ceptron (MLP) classifiers to predict the saliency score of deep features extracted from each image processing unit [26]-[28]. Later, a more effective and efficient form, i.e., fully convolutional network (FCN)-based network, becomes the mainstream of SOD architecture. Different deep models have different levels of supervision, and may use different learning paradigm during training. Specially, some SOD methods further distinguish individual instances among all the detected salient objects [36], [55]. A brief chronology is shown in Fig. 1.

* Non-deep learning model. t Weakly-supervised model. ° Bounding-box output. { Training on subset. - Results not available. Benchmarking results of 29 state-of-the-art deep SOD models and 3 top-performing classic SOD methods on 6 famous datasets (See §5.1). e Structural measure (S-measure) [128], different from the above metrics which only address pixel-wise errors, eval- uates structural similarity between the real-valued saliency map and the binary ground-truth. S-measure (5) considers two terms, S, and S,., referring to object-aware and region- aware structure similarities, respectively: TABLE 5 Descriptions of attributes. See §5.2 for more details.

5 BENCHMARKING AND ANALYSIS 5.1. Overall Performance Benchmarking Resu | BENCHMARKING AND ANALYSIS 5.1 Overall Performance Benchmarking Results Table 4 shows performances of 29 state-of-the-art deep SOD models and 3 top-performing classic SOD methods on 6 popular datasets widely used and tested in SOD research. Three evaluation metrics, i.e. maximal Fg [30], S- measure [128] and MAE [32] are used for assessing pixel- wise saliency prediction accuracy and the structure simi- larity of salient regions. All the 32 benchmarked models are representative, and have publicly available implementations or saliencv prediction results on the 6 selected datasets.

-ig. 4. Sample images from the hybrid benchmark consisting of images randomly selected from 6 SOD datasets. Saliently regions are uniformly lighlighted. Corresponding attributes are listed. See §5.2 for more detailed descriptions.

* Non-deep learning model. ce Attribute-based study w.r.t. salient object categories, challenges and scene categories. (-) indicates the percentage of the images with a specific attribute. ND-avg indicates the average score of three top-performing heuristic models: HS [34], DRFI [48] and wCtr [35]. D-avg indicates the average score of three top-performing deep learning models: DGRL [88], PAGR [89] and PiCANet [39]. All the three models are trained on DUTS [73]. (Best in red, worst with underline; See §5.2 for details).

Attribute statistics of top and bottom 100 images based on F-measure. (-) indicates the percentage of the images with a specific attribute. ND-avg indicates the average results of three top-performing heuristic models: HS [34], DRFI [48] and wCtr [35]. D-avg indicates the average results of three top-performing deep models: DGRL [88], PAGR [89] and PiCANet [39]. (Two largest changes in by red if positive, blue if negative; See §5.2) The experimented input perturbations include Gaussian blur, Gaussian noise, Rotation, and Gray. More specifically, for studying the effects of blurring of different degrees, we blur the images using Gaussian kernels with sigma set to 2 or 4. For noise, we select two variance values, i.e. 0.01 and 0.08 covering both tiny and medium magnitudes. For rotation, we rotate the images for +15° and —15°, respectively, and cut out the largest box with the original aspect ratio. The gray images are generated using Matlab rgb2gray func- tion. IFADLE O Input perturbation study on the hybrid benchmark (§5.2). Perturbations include Gaussian blur, Gaussian noise, Rotation and Gray. ND-avg indicates the average score of three top-performing heuristic models: HS [34], DRFI [48] and wCtr [35]. D-avg indicates the average score of three representative deep learning models: SRM [81], DGRL [88] and PiCANet [39]. See §5.3 for details. (Best in red, worst with underline).

Fig. 5. Examples of saliency prediction under various input perturbations. The max F values are denoted using red. See §5.3 for more details.

Fig. 6. Adversarial examples for saliency prediction under adversarial perturbations of different target networks. Adversarial perturbations are magnified by 10 for better visualization. The max F values are denoted using red. See §5.4 for more details.

Results for adversarial attack experiments. max F t on the hybrid benchmark is presented when exerting adversarial perturbations from different models. See § 5.4 for details. (Worst with underline) 5.4.2 Transferability across Networks

Fig. 7. Network architecture of the SOD model used in cross-dataset generalization evaluation. See §5.5 for more detailed descriptions.

Results for cross-dataset generalization experiment. max F + for saliency prediction when training on one dataset (rows) and testing on another (columns), /.e., each row is: training on one dataset and testing on all the datasets. “Self” refers to training and testing on the same dataset (same as diagonal). “Mean Others” indicates average performance on all except self. See &5.5 for details.

Fig. 8. Examples for annotation inconsistency. Each row shows twc exemplar image pairs. See §6.2 for more detailed descriptions. The first improvement of SOD annotation quality is to replace the bounding-boxes with pixel-wise masks for denoting the salient objects [30], [121], which greatly boost the performance of SOD models. In view of this, almost all the modern SOD datasets have been annotated with pixel- level labels. However, the labeling precisions may be dif- ferent across different samples. For example, The precision for the bicycle in Fig. 8 are obviously different. There has no comprehensive study about the relation between label quality and model performance for SOD. A similar research regarding pixel-level labeling quality of semantic segmenta- tion [150] has shown that a large number of coarse-labeled data can reach the performance of smaller number of fine- labeled data, and that pre-training with coarse labels then fine-tuning with a small number of fine labels is competitive with training with a large number of fine labels. Though some works have shown the importance of high-quality labels [120], [151], more in-depth study is in demand for SOD model training and dataset construction.

Related topics:

Information Systems Electrical and Electronic Engineering

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved