Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 11 – uploaded by Fabio Cuzzolin

See full PDF downloadDownload figure

In this section, we further evaluate the intersection probability on corrupted samples using the neg- ative log-likelihood (NLL) metric. A smaller NLL indicates that the model is more confident and accurate in predicting the correct class for each input (Dusenberry et al., 2020). Figures 11 and 12 show the consistent superiority of the intersection probability on corrupted data in extensive test cases, as evidenced by smaller NLL values. Figure 11: NLL values of BNNR, BNNF, and DE on CIFARIO-C against increased corruption intensity, using the averaged probability (Avg. Prob.) and our proposed intersection probability (Int. Prob.). VGG16 and ResNet-18 are backbones. Results are from 15 runs. — Figure 11 In this section, we further evaluate the intersection probability on corrupted samples using the neg- ative log-likelihood (NLL) metric. A smaller NLL indicates that the model is more confident and accurate in predicting the correct class for each input (Dusenberry et al., 2020). Figures 11 and 12 show the consistent superiority of the intersection probability on corrupted data in extensive test cases, as evidenced by smaller NLL values. Figure 11: NLL values of BNNR, BNNF, and DE on CIFARIO-C against increased corruption intensity, using the averaged probability (Avg. Prob.) and our proposed intersection probability (Int. Prob.). VGG16 and ResNet-18 are backbones. Results are from 15 runs.

Related Figures (25)

[Inspired by the use of probability intervals for decision-making (Yager & Kreinovich, 1999; Guo & Tanaka, 2010), we propose to build probability intervals by extracting the upper and lower bound per class from the given set of limited (categorical) probability distributions, validating this choice via extensive experiments in Section 4. E.g., consider again the task of predicting weather con- ditions (rainy, sunny, or cloudy). When receiving three probability values for the rainy condition, 2.g., 0.2, 0.1, and 0.7, using probability intervals we model the uncertainty on the probability of the rainy condition as [0.1,0.7]. Each probability interval system can determine a convex set of probabilities over the set of classes, i.e., a credal set. Such a credal set is a more natural model than individual distributions for representing the epistemic uncertainty encoded by the prediction, as it amounts to constraints on the unknown exact distribution (Hiillermeier & Waegeman, 2021; Shaker & Hiillermeier, 2021; Sale et al., 2023a). Nevertheless, a single predictive distribution, termed in- tersection probability, can still be derived from a credal set to generate a unique class prediction for classification purposes. Our credal wrapper framework is depicted in Figure 1. The remainder of this section discusses the credal wrapper generation, a method for computational complexity reduction for uncertainty estimation, and the intersection probability, in this order. Figure 1: Credal wrapper framework for a three-class (A, B, D) classification task. Given a set of individual probability distributions (denoted as single dots) in the simplex (triangle) of probability distributions of the classes, probability intervals (parallel lines) are derived by extracting the upper and lower probability bounds per class, using eq. (5). Such lower and upper probability intervals induce a credal set on {A, B, D} (P, light blue convex hull in the triangle). A single intersection probability (the red dot) is computed from the credal set using the transform in eq. (5). Uncertainty is estimated in the mathematical framework of credal sets in eq. (4).

Table 1: Performance comparison between the classical and credal wrapper version of BNN and DE, as well as EDD models. All models are implemented on VGG16/ResNet-18 backbones and trained using CIFAR10 data as ID samples. The results are from 15 runs. The best scores per metric are in bold. The results on corrupted data are averaged over all corruption types and intensities.

below the baselines) and poor EU estimation (evidenced by the lowest OOD detection values), as shown in Table 15 in the Appendix. Figure 2: OOD detection using EU as the metric on CIFAR10 vs CIFAR10-C of the classical and credal wrapper version of BNNs and DE, and EDD against increased corruption intensity, using VGG16 and ResNet-18 as backbones.

Figure 3: ECE values of BNNR, BNNF, and DE on CIFARIO-C against increased corruption in- tensity, using the averaged probability (Prob.) and our proposed intersection probability (Prob.). VGGI16 and ResNet-18 are backbones. Results are from 15 runs.

Table 2: OOD detection AUROC and AUPRC performance (%) of both the classical and credal wrapper version of DEs using EU as the metric. The results are from 15 runs, based on the ResNet- 50 backbone. Best scores are in bold. datasets and model architectures due to the high computational complexity (Mukhoti et al., 2023). For instance, training a ResNet-50-based BNN on CIFAR- 0 (resized to (224, 224, 3)) failed in our experiment due to exceeding the memory capacity of a single Nvidia A100 GPU. The dataset pairs (ID vs OOD) considered include CIFAR10/CIFAR100 (Krizhevsky, 2012) vs SVHN/Tiny- ImageNet, ImageNet (Deng et al., 2009) vs ImageNet-O (Hendrycks et al., 2021), CIFAR10 vs CIFAR10-C, and CIFAR100 vs CIFAR100-C (Hendrycks & Dietterich, 2019). DEs are imple- mented on the well-established ResNet-50 (He et al., 2016). All input data have a shape of (224, 224, 3). More training details are given in Appendix §B. The PIA algorithm (Algorithm 1) is ap- plied using the settings J = 20 and J =50 to calculate the generalized entropy (H(P) and H(P)) on dataset pairs involving CIFAR100 and ImageNet, respectively. Compared to classical DEs, out credal wrapper demonstrates the enhanced OOD detection across a spectrum of data pairs, as shown in Table 2, suggesting that our proposed method can consistently improve EU estimation.

Figure 4: OOD detection performance of the classical and credal wrapper version of DEs using EL as the metric on CIFAR10/100 vs CIFAR10-C/100-C against increased corruption intensity, using ResNet-50, EffB2, and ViT-B as backbones.

Table 4: OOD detection performance (%) comparison in DEs.

Figure 5: ECE values of DEs on CIFAR10-C and CIFAR100-C against increased corruption in- tensity, using the averaged probability (Prob.) and our proposed intersection probability (Prob.). ResNet-50, EffB2, and ViT-B are backbones. Results are from 15 runs. Then, we construct DEs using different numbers of ensemble members, namely N =3, 5, 10, 15, 20 and 25. Each type of DEs includes 15 instances using distinct seed combinations.

Figure 6: ECE values of DEs with various NV on CIFAR10-C against increased corruption intensity. of samples overall can lead to a lower ECE value; (ii) Compared to the the naive averaging DE pre- dictions, our intersection probability consistently achieves lower ECE values on corrupted instances Ablation Study on Numbers of Predictive Samples in BNNs In this experiment, we first increase the sampling size of BNNs at prediction time, namely N = 10 and N =50. Table 6 reports the OOD detection performance of BNNR and BNNF involving CIFAR10 (ID) vs SVHN and Tiny-ImageNet (OODs), based on the VGG16 backbone. It shows that credal wrapper consistently produces better EU estimates, as evidenced by enhanced OOD detection performance. Further, Table 12 in the Ap- pendix reports the same comparison based on the ResNet-18 architecture, confirming those results.

Table 6: OOD detection comparisons using EU (%) of VGG16-based BNNs.

Table 7: OOD detection AUROC and AUPRC performance (%) comparison between classical and credal wrapper of BNNs and DEs using EU. The results are from 15 runs. Experimental Validation In this ablation study, we evaluate on EU estimation quality of our credal wrapper using GH(P) measure. Table 7 reports OOD detection performance tested on CIFARIO (ID) vs SVHN (OOD) and Tiny-ImageNet (OOD). All models are implemented on the VGG16 backbones. The results demonstrate that our credal wrapper consistently enhances EU estimation performance and is agnostic in the sense that it can accommodate any EU measure for credal sets.

Table 8: OOD detection AUROC and AUPRC performance (%) comparison between the classical and credal wrapper version of BNNs and DEs, using TU as the uncertainty metric. All models are implemented on VGG16/ResNet-18 backbones and tested on CIFAR10 (ID) vs SVHN (OOD) and Tiny-ImageNet (OOD). The results are from 15 runs. The best scores are in bold. Table 9: OOD detection AUROC and AUPRC performance (%) of both the classical and credal wrapper version of DEs using TU as the metric. The results are from 15 runs, based on the ResNet- 30 backbone. The best scores are in bold.

Table 10: OOD detection AUROC and AUPRC performance (%) of both the classical and credal wrapper version of DEs using TU as the metric. Results are from 15 runs, based on EffB2 and ViT-B backbones. The best scores are in bold. \.3. ABLATION STUDY ON OVERCONFIDENCE REGIME

Figure 7: OOD detection using TU as metric on CIFAR10 vs CIFARIO-C of both the classical and credal wrapper version BNNs and DE against increased corruption intensity, using VGG16 and ResNet-18 as backbones.

Figure 8: OOD detection using TU as the metric on CIFAR10/100 vs CIFAR10-C/100-C of both the classical and credal wrapper version of DEs against increased corruption intensity, using ResNet-50, EffB2, and ViT-B as backbones.

Table 11: Ablation study on numbers of predictive samples in DEs: OOD detection AUROC and AUPRC performance (%) of both the classical and credal wrapper version of DEs using TU as uncertainty metrics, involving CIFAR1O (ID) vs SVHN (OOD) and Tiny-ImageNet (OOD). The results are from 15 runs. The best scores are in bold.

Table 12: Ablation study on numbers of predictive samples in BNNs: OOD detection AUROC and AUPRC performance (%) of both the classical and credal wrapper version of BNNs with increased number of samples, involving CIFARIO (ID) vs SVHN (OOD) and Tiny-ImageNet (OOD). The results are from 15 runs and the best scores per uncertainty metric are in bold. in the case of 3 classes, if there are three distinct extreme probability vectors (three vertices of the simplex), our credal wrapper method will effectively convey complete uncertainty, with the resulting credal set encompassing the entire simplex. This conservative nature can be sensible, as it expresses our full ignorance of the correct classification.

Table 13: OOD detection using EU (left) and TU (right) as uncertainty metrics in overconfider scenarios. The results are from 15 runs, based on the ResNet-50 backbone. The best scores pe uncertainty metric are in bold.

Figure 9: EU and TU estimates of ID (CIFAR10) and OOD (SVHN and Tiny-ImageNet) samples of the classical and credal wrapper version of DEs, obtained using ResNet-50, EffB2, and ViT backbones. Results are from 15 runs.

Figure 10: EU and TU estimates of ID (CIFAR100) and OOD (SVHN and Tiny-ImageNet) samples of the classical and credal wrapper version of DEs, obtained using ResNet-50, EffB2, and ViT backbones. Results are from 15 runs.

Figure 12: NLL values of DEs on CIFAR1I0-C and CIFAR100-C against increased corruption in- tensity, using the averaged probability (Avg. Prob.) and our proposed intersection probability (Int. Prob.). ResNet-50, EffB2, and ViT-B are backbones. Results are from 15 runs.

Table 14: OOD detection AUROC and AUPRC performance (%) of credal wrapper of DEs using EU (left) and TU (right) as uncertainty metrics, and the time cost, using different setting of J of PIA algorithm. The OOD detection involves CIFAR100 (ID) vs SVHN (OOD) and Tiny-ImageNet (OOD). The results are from 15 runs, based on the ResNet-50 backbone.

Table 15: Poor ID prediction and OOD detection performance of EDD-Fair. CIFAR10 as ID data OOD Detection Process In this paper, the OOD detection process is treated as a binary classifica- tion. We label ID and OOD samples as 0 and 1, respectively. The model’s uncertainty estimation (using the EU or TU) for each sample is the ‘prediction’ for the detection. In terms of performance indicators, the applied AUROC quantifies the rates of true and false positives. The AUPRC evalu- ates precision and recall trade-offs, providing valuable insights into the model’s effectiveness across different confidence levels.

Figure 13: Different credal set generation methods. Left: our credal wrapper; right: convex hull. The theoretical underpinning for the convex hull method when reasoning with coherent lower prob- abilities (and, therefore, the corresponding credal sets) is that it allows us to comply with the co- herence principle (Walley, 1991). In a Bayesian context, individual predictions (such as those of networks with specified weights) can be interpreted as subjective pieces of evidence about a fact (e.g., what is the true class of an input observation). Coherence ensures that one realizes the full implications of such partial assessments (Walley, 1991; Cuzzolin, 2008). Figure 13 conceptually shows the differences between two methods in a 2D simplex. Compared to the convex hull method, our probability interval systems exhibit a more conservative nature. Another practical difference is that the convex hull method is highly computationally com- plex, preventing it from being practically implemented in multi-class classification tasks. In the following, we aim to explain the associated complexity of the calculation process.

Related topics:

Artificial Intelligence Machine Learning Reasoning about Uncertainty Statistical machine learning Decision Making Under Uncertainty Probability and statistics Uncertainty Quantification Applied Probability Interval analysis

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved