AlphaNet: Improving Long-Tail Classification By Combining Classifiers

  • $\bm{\dag}$ Carnegie Mellon University
  • $\bm{\ddag}$ University of Illinois Urbana-Champaign

Equal contribution

Abstract

Methods in long-tail learning focus on improving performance for data-poor (rare) classes; however, performance for such classes remains much lower than performance for more data-rich (frequent) classes. Analyzing the predictions of long-tail methods for rare classes reveals that a large number of errors are due to misclassification of rare items as visually similar frequent classes. To address this problem, we introduce AlphaNet, a method that can be applied to existing models, performing post hoc correction on classifiers of rare classes. Starting with a pre-trained model, we find frequent classes that are closest to rare classes in the model’s representation space and learn weights to update rare class classifiers with a linear combination of frequent class classifiers. AlphaNet, applied to several models, greatly improves test accuracy for rare classes in multiple long-tailed datasets, with very little change to overall accuracy. Our method also provides a way to control the trade-off between rare class and overall accuracy, making it practical for long-tail classification in the wild.

1 Introduction

The significance of long-tailed distributions in real-world applications (such as autonomous driving1, and medical image analysis2) has spurred a variety of approaches for long-tail classification3. Learning in this setting is challenging because many classes are “rare” – having only a small number of training samples. Some methods re-sample more data for rare classes in an effort to address data imbalances4,5, while other methods adjust learned classifiers to re-weight them in favor of rare classes6. Both re-sampling and re-weighting methods provide strong baselines for long-tail classification tasks. However, state-of-the-art results are achieved by more complex methods that, for example, learn multiple experts7,8, perform multi-stage distillation9, or use a combination of weight decay, loss balancing, and norm thresholding10.

Despite these advances, accuracy on rare classes continues to be significantly lower than overall accuracy. For example, on ImageNet‑LT – a long-tailed dataset sampled from ImageNet11 – the 6-expert ensemble RIDE model7 has an average accuracy of 68.9% on frequent classes, but an average accuracy of 36.5% on rare classes.n1 In addition to reducing overall accuracy, such performance imbalances raise ethical concerns in contexts where unequal accuracy leads to biased outcomes, such as medical imaging12, or face detection13. For instance, models trained on chest X‑ray images consistently under-diagnosed minority groups14, and similarly, cardiac image segmentation showed significant differences between racial groups15.

(a)
(b)
(c)

Figure 1: Analysis of ‘few’ split predictions on ImageNet‑LT. (a) Predictions from cRT model on test samples from ‘few’ split of ImageNet‑LT. For a misclassified sample, if the predicted class is one of the 5 ‘base’ split nearest neighbors (NNs) of the true class, it is considered to be incorrectly classified as a NN. A large number of samples are misclassified in this way. (b) Sample images from two classes in ImageNet‑LT. ‘Lhasa’ is a ‘few’ split class, and ‘Tibetan terrier’ is a ‘base’ split class. The classes are visually very similar, leading to misclassifications. (c) Per-class test accuracy of cRT model on ‘few’ split of ImageNet‑LT, versus the mean Euclidean distance to 5 nearest neighbor (NN) ‘base’ split classes. The line is a bootstrapped linear regression fit, and ‘rr’ (top right) is Pearson correlation. There is a high correlation, i.e., ‘few’ split classes with close ‘base’ split NNs are more likely to be misclassified.

To understand the poor rare class performance of long-tail models, we analyzed predictions of the cRT model6 on test samples from ImageNet‑LT’s ‘few’ split (i.e., classes with limited training samples). Figure 1 (a) shows predictions binned into three groups: (1) samples classified correctly; (2) samples incorrectly classified as a visually similar ‘base’ splitn2 class (e.g., ‘husky’ instead of ‘malamute’); and (3) samples incorrectly classified as a visually dissimilar class (e.g., ‘goldfish’ instead of ‘malamute’). A significant portion of the misclassifications (about 23%) are to visually similar frequent classes. Figure 1 (b) highlights the reason behind this issue, with samples from one pair of visually similar classes; the differences are subtle, and can be hard even for humans to identify. To get a quantitative understanding, we analyzed the relationship between per-class test accuracy and mean distance of a class to its nearest neighbors (see Section 3 for details). Figure 1 (c) shows a strong positive correlation between accuracy and mean distance, meaning that rare classes with close neighbors have lower test accuracy than classes with distant neighbors.

Based on these analyses, we designed a method to directly improve the accuracy on rare classes in long-tail classification. Our method, AlphaNet, uses information from visually similar frequent classes to improve classifiers for rare classes. Figure 2 illustrates the pipeline of our method. At a high level, AlphaNet can be seen as moving the classifiers for rare classes based on their position relative to visually similar classes. Importantly, AlphaNet updates classifiers without making any changes to the representation space, or to other classifiers in the model. It performs a post hoc correction, and as such, is applicable to use cases where existing base classifiers are either unavailable or fixed (e.g., due to commercial interests or data privacy protections). The simplicity of our method lends to computational advantages – AlphaNet can be trained rapidly, and on top of any classification model. We will demonstrate that AlphaNet, applied to a variety of long-tail classification models, significantly improves rare class accuracy on multiple datasets.

2 Related work

Our work falls in the domain of long-tail learning, where the distribution of class sizes – measured via number of training samples – models that of the visual world; many classes have only a few samples, while a small number have many16. Kang et al.6 established strong baselines on long-tailed datasets by decoupling classifiers and representations. We apply AlphaNet to of their proposed baseline methods: (1) the cRT (classifier re-training) model, which fixes representations, and trains classifiers from scratch; and (2) the LWS (learnable weight scaling) model, which also fixes representations, and only rescales classifiers, with scales learned from the training data.

In contrast to the above simple methods, many complex methods have been proposed, and have continued to push the state-of-the-art for long-tail recognition17,7,9,8,10. We used two of these methods to evaluate AlphaNet: (1) the RIDE (RoutIng Diverse Experts) model of Wang et al.7, which achieves low bias and variance, by training with a “distribution-aware diversity loss” and using multiple experts, respectively; and (2) the LTR (long-tail recognition) model of Alshammari et al.10, which uses a combination of class-balanced loss, weight decay, and max-norm regularization – in this work we will refer to this model simply as the LTR model.

In the rest of this section, we discuss some works that make use of similar ideas as AlphaNet.

2.1 Knowledge transfer

AlphaNet bears resemblance to methods that create new classifiers by transferring knowledge from existing classifiers. These methods appear in a number of domains, such as transfer learning, meta-learning, and multi-task learning18,19,20,21. It should be noted, however, that AlphaNet does not create new classifiers – it only modifies existing classifiers by combining them with others.

Pertinent to our problem setting is the work by Wang et al.22, who showed that a “generic, category agnostic transformation” can be learned from models trained on few samples, to models trained on many samples. In our work, we implicitly learn a similar transformation, but with the source and target classifiers within the same model. Additionally, the transformation is constrained to be a linear combination. A similar paradigm was analyzed by Du et al.23, who showed that for cases where the target function is generated by a simple transformation of the source function, there are theoretical performance guarantees for a large class of functions.

2.2 Classifier composition

In low-shotn3 and zero-shot classification, new classifiers are learned using few or zero training examples24,25. Some methods have done this by directly combining existing classifiers. For example, Mensink et al.26 learned new classifiers as linear combinations of existing classifiers, with weights determined by co-occurrence statistics. Changpinyo et al.27 introduced “phantom classes” and used their classifiers as bases to compose new classifiers through convex combination.

The idea of combining existing classifiers was also used by Aytar et al.28, in their work on enhancing single exemplar support vector machines (SVMs). In their method, an extra regularization term is added to the SVM loss function, which encourages the learned classifier to be close to a linear combination of previously learned classifiers. Classifiers trained on image patches are used to transfer knowledge to a classifier trained on a single positive exemplar.

2.2.1 Boosting

Composing weak classifiers to build strong classifiers also bears resemblance to the idea of boosting29. The popular AdaBoost30 algorithm linearly combines classifiers based on single features (e.g., decision stumps), and iteratively re-weights training samples based on their error. For the case of multi-class classification, Torralba et al.31 built a classifier that combines several binary classifiers, each designed to separate a single class from the others. Their method identifies common features that be shared across classifiers, which reduces the computational load, and the amount of training data required.

It is important to note that boosting methods employ a different form of composition that our methods. Specifically, our focus is on classification methods where the performance on a subset of classes is poor. Unlike boosting methods, we do not incorporate additional features – improvements are made by adjusting classifiers within the learned representation space.

3 Method

Our problem setting is multi-class classification with CC classes, where each input has a corresponding class label in {0,,C1}\left\{0, \dots, C - 1\right\}, and the goal is to learn a mapping from inputs to labels. We are specifically interested in visual recognition; so inputs are images, and classes are object categories. AlphaNet is applied to a pre-trained classification model; we assume that this model can be decoupled into two parts: the first part maps images to feature vectors, and the second part maps feature vectors to “scores”, one for each of the CC classes. The prediction for an image is the class index with largest corresponding score. Typically (and in all our experiments), for a convolutional network, the feature vector for an image is the output of the penultimate layer, and the last layer is a linear mapping. So, each classifier is a vector, and the score for a class is the dot product of the feature vector with the corresponding classifier. Typically a bias term is present, which is added to the dot product. We do not modify this term, and use it as-is if present.

In this work, we define the distance between two classes as the distance between their average training set representation. Let ff be the function mapping images to feature vectors in a dd-dimensional representation space. For a class cc with ncn^c training samples I1c,,InccI^c_1, \dots, I^c_{n^c}, let zc(1/nc)if(Iic)\bm{z}^c \equiv (1/n^c) \sum\nolimits_i f(I^c_i) be the average training set representation. Given a distance function μ:Rd×RdR\mu: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}, we define the distance between two classes c1c_1 and c2c_2 as mμ(c1,c2)μ(zc1,zc2)m_\mu(c_1, c_2) \equiv \mu(\bm{z}^{c_1}, \bm{z}^{c_2}).

Given a long-tailed dataset, the ‘few’ split (CFC^F), is defined as the set of classes with fewer than TT training samples, for some constant TT (equal to 20 for the datasets used in this work). The remaining classes form the ‘base’ split (CBC^B). AlphaNet is used to update the ‘few’ split classifiers using nearest neighbors from the ‘base’ split.

3.1 AlphaNet implementation

Figure 2: Pipeline for AlphaNet. Given a rare class, we identify the nearest neighbor frequent classes based on visual similarity, and then update the rare class’ classifier using learned coefficients. One coefficient, α\alpha, is learned for each nearest neighbor. The result is an improved classifier for the rare class.

Figure 2 shows the pipeline of our method. Given a ‘few’ split class cc with classifier wc\bm{w}^c, we find its kk nearest ‘base’ split neighbors based on mμm_\mu. Let these neighbors have classifiers v1c,,vkc\bm{v}^c_1, \dots, \bm{v}^c_k, which are concatenated together into a vector vc\overline{\bm{v}}^c. AlphaNet maps (wc,vc)(\bm{w}^c, \overline{\bm{v}}^c) to a set of coefficients α1c,,αkc\alpha^c_1, \dots, \alpha^c_k. The α\alpha coefficients (denoted together as a vector αc\bm{\alpha}^c), are then scaled to unit 1-norm to obtain α~c\tilde{\bm{\alpha}}^c (the justification for this will be presented shortly): α~cαc/αc1.(1) \tilde{\bm{\alpha}}^c \equiv \bm{\alpha}^c / \left\|{\bm{\alpha}^c}\right\|_1. \tag{1} The scaled coefficients are used to update the ‘few’ split classifier (wcww^c\bm{w}^c \to \vphantom{\bm{w}}\smash[t]{\hat{\bm{w}}}^c) through a linear combination: ww^cwc+i=1kα~icvic.(2) \vphantom{\bm{w}}\smash[t]{\hat{\bm{w}}}^c \equiv \bm{w}^c + \sum\limits_{i=1}^k \tilde{\alpha}^c_i \bm{v}^c_i. \tag{2} Due to the 1-norm scaling, we have ww^cwc2i=1kα~icvic2(Cauchy-Schwarz inequality)maxi=1,,kvic2i=1kα~ic=maxi=1,,kvic2α~c1=maxi=1,,kvic2,(3) \begin{aligned} \left\|{\vphantom{\bm{w}}\smash[t]{\hat{\bm{w}}}^c - \bm{w}^c}\right\|_2 &\le \sum\limits_{i=1}^k \left|{\tilde{\alpha}^c_i}\right| \cdot \left\|{\bm{v}^c_i}\right\|_2 \quad \text{(Cauchy-Schwarz inequality)} \\ % &\le \max_{i=1,\dots,k} \left\|{\bm{v}^c_i}\right\|_2 \sum\limits_{i=1}^k \left|{\tilde{\alpha}^c_i}\right| \\ &= \max_{i=1,\dots,k} \left\|{\bm{v}^c_i}\right\|_2 \left\|{\tilde{\bm{\alpha}}^c}\right\|_1 \\ &= \max_{i=1,\dots,k} \left\|{v^c_i}\right\|_2, \end{aligned} \tag{3} that is, a classifier’s change is bound by its nearest neighbors. Thanks to this, we do not need to rescale ‘base’ split classifiers, which may not be possible in certain domains.

Note that a single network is used to generate coefficients for every ‘few’ split class. So, once trained, AlphaNet can be applied even to classes not seen during training. This will be explored in future work.

3.2 Training

The trainable component of AlphaNet is a network with parameters θ\bm{\theta}, which maps (wc,vc)(\bm{w}^c, \overline{\bm{v}}^c) to αc\bm{\alpha}^c. We use the original classifier biases, b\bm{b} (one per class). So, given a training image II, the per-class prediction scores are given by s(c;I)={f(I)Tww^c+bccCF.f(I)Twc+bccCB.(4) s(c; I) = \begin{cases} f(I)^T \vphantom{\bm{w}}\smash[t]{\hat{\bm{w}}}^c + b_c & c \in C^F. \\ f(I)^T \bm{w}^c + b_c & c \in C^B. \end{cases} \tag{4} That is, class scores are unchanged for ‘base’ split classes, and are computed using updated classifiers for ‘few’ split classes. These scores are used to compute the sample loss (softmax cross-entropy in our experiments), a differentiable function of θ\bm{\theta}. So, θ\bm{\theta} can be learned using a gradient based optimizer, from mini-batches of training samples.

4 Experiments

4.1 Experimental setup

A detailed description of the experimental methods is contained in Appendix A. A short summary is presented here.

Datasets. We evaluated AlphaNet using three long-tailed datasets:n4 ImageNet‑LT and Places‑LT, curated by Liu et al.17 and CIFAR‑100‑LT, created using the procedure described by Cui et al.4. These datasets are sampled from their respective original datasets – ImageNet33, Places36534, and CIFAR‑10035 – such that the number of per-class training samples has a long-tailed distribution.

The datasets are broken down into three broad splits based on the number of training samples per class: (1) ‘many’ contains classes with greater than 100 samples; (2) ‘medium’ contains classes with greater than or equal to 20 samples but less than or equal to 100 samples; and (3) ‘few’ contains classes with fewer than 20 samples. The test set is always balanced, containing an equal number of samples for each class. We refer to the combined ‘many’ and ‘medium’ splits as the ‘base’ split.

Training data sampling. In order to prevent over-fitting on the ‘few’ split samples, we used a class balanced sampling approach, using all ‘few’ split samples, and a portion of the ‘base’ split samples. Given FF ‘few’ split samples and a ratio ρ\rho, ρF\rho F samples were drawn from the ‘base’ split every epoch, with sample weights inversely proportional to the size of their class. This ensured that all ‘base’ classes had an equal probability of being sampled.n5 As we show in the following section, ρ\rho allows us to control the balance between ‘few’ and ‘base’ split accuracy.

Training. All experiments used an AlphaNet with three 32 unit layers. Unless stated otherwise, Euclidean distance was used to find k=5k=5 nearest neighbors for each ‘few’ split class. In this section, we show results for ρ\rho in {0.5,1,1.5}\left\{0.5, 1, 1.5\right\}. Results for a larger set of ρ\rhos are shown in Appendix C. All experiments were repeated 10 times, and we report average results.

4.2 Long-tail classification results

Table 1: Mean split accuracy in percents (standard deviation in superscript) of AlphaNet and various baseline methods on ImageNet‑LT and Places‑LT. α‑cRT and α‑LWS are AlphaNet models applied over cRT and LWS features respectively.
MethodFewMed.ManyOverall
ImageNet‑LT
NCM28.145.356.647.3
τ‑normalized30.746.959.149.4
cRT27.446.261.849.6
α‑cRT
ρ=0.539.71.4242.00.6658.30.5248.00.37
ρ=134.61.8843.70.5159.70.4348.60.24
ρ=1.532.62.4644.40.4960.30.3848.90.19
LWS30.447.260.249.9
α‑LWS
ρ=0.546.90.9838.60.8752.90.8645.30.69
ρ=141.61.6142.20.5356.00.3247.40.30
ρ=1.540.11.9943.20.9856.90.7648.00.53
Places‑LT
NCM27.337.140.436.4
τ‑normalized30.746.959.149.4
cRT24.937.642.036.7
α‑cRT
ρ=0.531.00.8834.50.1740.40.2935.90.09
ρ=127.01.0236.10.3141.30.1336.20.10
ρ=1.525.50.8936.50.3641.60.2136.20.11
LWS28.739.140.637.6
α‑LWS
ρ=0.537.11.3934.40.8037.70.5236.10.31
ρ=134.60.9735.80.5438.60.3936.60.22
ρ=1.532.21.1737.20.3639.50.3937.00.11

Baseline models. First, we applied AlphaNet to models fine-tuned using classifier re-training (cRT), and learnable weight scaling (LWS)6. These models have good overall accuracy, but accuracy for ‘few’ split classes is much lower. On ImageNet‑LT, average ‘few’ split accuracy using a ResNeXt‑50 backbone is around 20 points below the overall accuracy for both cRT and LWS, as seen in Table 1, which also shows other baseline methods – nearest classifier mean (NCM), which predicts the nearest neighbor using average class representation, and τ\tau‑normalized, which scales classifier weights by their τ\tau-norm6.

Using features extracted from the cRT and LWS models, we used AlphaNet to update ‘few’ split classifiers creating α\alpha‑cRT and α\alpha‑LWS respectively. Per-split accuracies, obtained by training with ρ\rho in {0.5,1,1.5}\left\{0.5, 1, 1.5\right\}, are shown in Table 1. We get a significant increase in the ‘few’ split accuracy for all values of ρ\rho. Moreover, we see that ρ\rho allows us to control the balance between ‘few’ split and overall accuracies. Using larger values of ρ\rho – i.e., training with more ‘base’ split samples – allows overall accuracy to remain closer to the original, while still affording significant gains to ‘few’ split accuracy. With ρ=1.5\rho=1.5, α\alpha‑cRT boosts ‘few’ split accuracy by more than 5 points, while overall accuracy is within about 1 point. α\alpha‑LWS achieves even larger gains, increasing ‘few’ split accuracy to around 40%, while still maintaining a competitive 48% overall accuracy.

We repeated the above experiment on Places‑LT, where we see similar performance gains on the ‘few’ split (Table 1). Notably, with ρ=1\rho=1, α\alpha‑LWS increases ‘few’ split accuracy by about 6 points, while overall accuracy is within 1 point of the LWS model.

Table 2: Mean split accuracy in percents (standard deviation in superscript) on ImageNet‑LT and CIFAR‑100‑LT using the ensemble RIDE model. α‑RIDE applies AlphaNet on average features from the ensemble.
MethodFewMed.ManyOverall
ImageNet‑LT
RIDE36.554.468.957.5
α‑RIDE
ρ=0.543.50.7552.30.2667.30.1756.90.11
ρ=140.81.0053.10.2167.90.1857.10.11
ρ=1.538.21.2253.60.2568.40.1757.20.06
CIFAR‑100‑LT
RIDE25.852.169.350.2
α‑RIDE
ρ=0.532.31.2445.90.8764.60.7848.40.43
ρ=127.61.4149.50.8367.40.7049.20.16
ρ=1.525.21.1150.20.5768.30.3449.00.26
Table 3: Mean split accuracy in percents (standard deviation in superscript) on CIFAR‑100‑LT using the LTR model.
MethodFewMed.ManyOverall
CIFAR‑100‑LT
LTR29.849.370.150.7
α‑LTR
ρ=0.536.21.3939.52.1163.44.1846.91.95
ρ=132.21.7743.41.2267.50.9548.50.73
ρ=1.532.01.4946.20.8667.31.0249.30.33

State-of-the-art models. Next, we applied AlphaNet to two state-of-the-art models: (1) the 6-expert ensemble RIDE model7, and (2) the weight balancing LTR model10. See Appendix A.2.1 for details on feature extraction for these models. Table 2 shows the base results for RIDE, along with AlphaNet results for ρ{0.5,1,1.5}\rho \in \left\{0.5, 1, 1.5\right\}. On ImageNet‑LT, ‘few’ split accuracy was increased by up to 7 points, and on CIFAR‑100‑LT, by 5 points. For the LTR model, we show results on CIFAR‑100‑LT in Table 3 – we are able to increase ‘few’ split accuracy by almost 7 points.

These results show that AlphaNet can be applied reliably with state-of-the-art models to significantly improve the accuracy for rare classes.

4.3 Comparison with control

Our method is based on the core hypothesis that classifiers can be improved using nearest neighbors. In this section, we directly evaluate this hypothesis. Based on the results in the previous section, the improvements in ‘few’ split accuracy could be attributed simply to the extra fine-tuning of the classifiers. So, using the cRT model on ImageNet‑LT, we retrained AlphaNet with 5 randomly chosen ‘base’ split classes as “neighbors” for each ‘few’ split class. This differs from our previous experiments only in the classes used to update ‘few’ split classifiers, so if AlphaNet’s improvements were solely due to extra fine-tuning, we should see similar results. However, as seen in Figure 4, training with nearest neighbors selected by Euclidean distance garners much larger improvements in ‘few’ split accuracy, with similar trends in overall accuracy. This supports our hypothesis that classifiers for data-poor classes can make use of information from visually similar classes to improve classification performance.

4.4 Prediction changes

As shown in Section 1, the cRT model frequently misclassifies ‘few’ split classes as visually similar ‘base’ split classes. Using the AlphaNet model with ρ=0.5\rho=0.5, we performed the same analyses as before. Figure 5 (a) shows the change in sample predictions, where we see that a large portion of samples previously misclassified as a nearest neighbor are correctly classified after their classifiers are updated with AlphaNet. Furthermore, as seen in Figure 3, AlphaNet improvements are strongly correlated to mean nearest neighbor distance. Classes with close neighbors, which had a high likelihood of being misclassified by the baseline model, see the biggest improvement in test accuracy.

Figure 3: Change in per-class test accuracy for ‘few’ split of ImageNet‑LT with α\alpha‑cRT, versus mean Euclidean distance to 5 nearest neighbors. Comparing with Figure 1 (c), we see that AlphaNet provides the largest boost to classes with close nearest neighbors, which have poor baseline performance.
Figure 4: Change in split accuracies for α\alpha‑cRT on ImageNet‑LT. For each value of ρ\rho, the two plots show the raw difference in split accuracy (with accuracy expressed as a fraction) for AlphaNet compared to the baseline cRT model. Left shows the results for normal training with 5 nearest neighbors by Euclidean distance, and right shows the results for training with 5 random “neighbors” for each ‘few’ split class. Training with nearest neighbors leads to a larger increase in ‘few’ split accuracy, especially for small ρ\rho, which cannot be accounted for by the additional fine-tuning of classifiers alone.
(a)
(b)
(c)

Figure 5: Change in sample predictions for α\alpha‑cRT (ρ=0.5\rho=0.5) on ImageNet‑LT. For each plot, the bars on the left show the distribution of predictions by the baseline model; and the bars on the right show the distribution for α\alpha‑cRT. The groupings follow the scheme described in Figure 1 (a). The counts are aggregated from 10 repetitions of training α\alpha‑cRT. The “flow” bands from left to right show the changes in individual sample predictions. (a) Predictions on ‘few’ split classes, with NNs selected from the ‘base’ split. (b) Predictions on ‘base’ split classes, with NNs selected from the ‘few’ split. (c) All predictions, with NNs selected from all classes. The hatched portions represent the ‘few’ split.

(a)
(b)
(c)

Figure 6: Change in sample predictions for α\alpha‑cRT (ρ=0.5\rho=0.5) grouped with respect to nearest neighbors identified using WordNet. This figure shows the same results as Figure 5, but grouped using differently defined nearest neighbors – the new nearest neighbors are only used for visualizing. (a) Predictions on ‘few’ split classes. (b) Predictions on ‘base’ split classes. (c) All predictions; hatched portions represent ‘few’ split.

4.5 Analysis of AlphaNet predictions

AlphaNet significantly boosts the accuracy of ‘few’ split classes. However, we do see a decrease in overall accuracy compared to baseline models, particularly for small values of ρ\rho. It is important to note that the increase in ‘few’ split accuracy is much larger than the decrease in overall accuracy. As discussed earlier, in many applications it is important to have balanced performance across classes, and AlphaNet succeeds in making accuracies more balanced across splits.

However, we further analyzed the prediction changes for ‘base’ split samples. Specifically, Figure 5 (b) shows change in predictions for ‘base’ split samples, with nearest neighbors selected from the ‘few’ split. We see a small increase in misclassifications as ‘few’ split classes. This leads to the slight decrease in overall accuracy, which is also evident in Figure 5 (c) where all predictions are shown, and with nearest neighbors from all classes.

This previous analysis was conducted using nearest neighbors identified based on visual similarity. Since this is dependent on the representation space of the particular model, we conducted an additional analysis to see the behavior of predictions with respect to semantically similar categories. For classes in ImageNet‑LT, we defined nearest neighbors using distance in the WordNet36 hierarchy. Specifically, if two classes (e.g., ‘Lhasa’ and ‘Tibetan terrier’) share a parent at most 4 levels higher in WordNet (in this example, ‘dog’), we consider them to be one of each other’s nearest neighbors. Figure 6 shows α\alpha‑cRT predictions grouped using nearest neighbors defined his way. We see that a large number of incorrect predictions are to semantically similar categories which can be hard for even humans to distinguish. This suggests that metrics for long-tail classification should be re-evaluated for large datasets with many similar classes. Considering only misclassifications to semantically dissimilar classes, we see that AlphaNet still improves performance on ‘few’ split classes, while maintaining overall accuracy. So, despite model-specific visual similarity, AlphaNet garners improvements at the semantic level, showing that it can be applied to models beyond those used in this paper.

5 Conclusion

The long-tailed nature of the world presents a challenge for classification models, due to the imbalance in the number of training samples per class. To address this problem, a number of methods have been proposed, but the focus is generally on achieving the highest overall accuracy. Consequently, many long-tail methods tend to have high overall accuracy, but with unbalanced per-class accuracies where frequent classes are learned well with high accuracies, and rare classes are learned poorly with low accuracies. Such models can lead to biased outcomes, which raises serious ethical concerns. In this paper, we proposed AlphaNet, a rapid post hoc correction method that can be applied to any model. Our simple method greatly improves the accuracy for data-poor classes, and re-balances per-class classification accuracies while preserving overall accuracy. AlphaNet can be deployed in any application where the base classifiers cannot be changed, but balanced performance is desirable – thereby making it useful in contexts where ethics, privacy, or intellectual property are concerns.

Acknowledgements

This paper is based upon work supported by the Department of Defense contract FA8702-15-D-0002, and the NSF Graduate Research Fellowship for Nadine Chang.

References

  1. Jiang CM, Najibi M, Qi CR, Zhou Y, Anguelov D. Improving the intra-class long-tail in 3D detection via rare example mining. In: Computer vision – ECCV 2022. 2022. p. 158–75.
  2. Yang Z, Pan J, Yang Y, Shi X, Zhou H-Y, Zhang Z, et al. ProCo: Prototype-aware contrastive learning for long-tailed medical image classification. In: Medical image computing and computer assisted intervention – MICCAI 2022. 2022. p. 173–82.
  3. Yang L, Jiang H, Song Q, Guo J. A survey on long-tailed visual recognition. International Journal of Computer Vision. 2022;130(7):1837–72.
  4. Cui Y, Jia M, Lin T-Y, Song Y, Belongie S. Class-balanced loss based on effective number of samples. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2019. p. 9260–9.
  5. Cao K, Wei C, Gaidon A, Arechiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. In: Advances in neural information processing systems. 2019.
  6. Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, et al. Decoupling representation and classifier for long-tailed recognition. In: International conference on learning representations. 2020.
  7. Wang X, Lian L, Miao Z, Liu Z, Yu S. Long-tailed recognition by routing diverse distribution-aware experts. In: International conference on learning representations. 2021.
  8. Cai J, Wang Y, Hwang J-N. ACE: Ally complementary experts for solving long-tailed recognition in one-shot. In: 2021 IEEE/CVF international conference on computer vision (ICCV). 2021. p. 112–21.
  9. Li T, Wang L, Wu G. Self supervision to distillation for long-tailed visual recognition. In: 2021 IEEE/CVF international conference on computer vision (ICCV). 2021. p. 610–9.
  10. Alshammari S, Wang Y-X, Ramanan D, Kong S. Long-tailed recognition via weight balancing. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2022. p. 6887–97.
  11. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. 2009. p. 248–55.
  12. Ricci Lara MA, Echeveste R, Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nature Communications. 2022;13(1):4581.
  13. Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In: Proceedings of the 1st conference on fairness, accountability and transparency. 2018. p. 77–91.
  14. Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine. 2021;27(12):2176–82.
  15. Puyol-Antón E, Ruijsink B, Piechnik SK, Neubauer S, Petersen SE, Razavi R, et al. Fairness in cardiac MR image analysis: An investigation of bias due to data imbalance in deep learning based segmentation. In: Medical image computing and computer assisted intervention – MICCAI 2021. 2021. p. 413–23.
  16. Zhang Y, Kang B, Hooi B, Yan S, Feng J. Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;1–20.
  17. Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu S. Large-scale long-tailed recognition in an open world. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2019. p. 2532–41.
  18. Thrun S, Pratt L, editors. Learning to learn. Springer New York, NY; 2012.
  19. Caruana R. Multitask learning. Machine Learning. 1997;28(1):41–75.
  20. Schmidhuber J, Zhao J, Wiering M. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning. 1997;28(1):105–30.
  21. Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22(10):1345–59.
  22. Wang Y-X, Hebert M. Learning to learn: Model regression networks for easy small sample learning. In: Computer vision – ECCV 2016. 2016. p. 616–34.
  23. Du SS, Koushik J, Singh A, Poczos B. Hypothesis transfer learning via transformation functions. In: Advances in neural information processing systems. 2017.
  24. Fei-Fei L, Fergus R, Perona P. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2006;28(4):594–611.
  25. Larochelle H, Erhan D, Bengio Y. Zero-data learning of new tasks. In: Proceedings of the 23rd national conference on artificial intelligence - volume 2. 2008. p. 646–51.
  26. Mensink T, Gavves E, Snoek CGM. COSTA: Co-occurrence statistics for zero-shot classification. In: 2014 IEEE conference on computer vision and pattern recognition. 2014. p. 2441–8.
  27. Changpinyo S, Chao W-L, Gong B, Sha F. Synthesized classifiers for zero-shot learning. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 5327–36.
  28. Aytar Y, Zisserman A. Part level transfer regularization for enhancing exemplar SVMs. Comput Vis Image Underst. 2015;138(C):114–23.
  29. Schapire RE. The strength of weak learnability. Machine learning. 1990;5(2):197–227.
  30. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997;55(1):119–39.
  31. Torralba A, Murphy KP, Freeman WT. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29(5):854–69.
  32. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, et al. The iNaturalist species classification and detection dataset. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018. p. 8769–78.
  33. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision. 2015;115(3):211–52.
  34. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;40(6):1452–64.
  35. Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto; 2009.
  36. Princeton University. About WordNet. https://wordnet.princeton.edu/; 2010.
  37. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems. 2019.
  38. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 770–8.
  39. Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). 2017. p. 5987–95.
  40. Loshchilov I, Hutter F. Decoupled weight decay regularization. In: International conference on learning representations. 2019.
  41. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011. p. 315–23.
  42. Hunter JD. Matplotlib: A 2D graphics environment. Computing in Science & Engineering. 2007;9(3):90–5.
  43. Waskom ML. Seaborn: Statistical data visualization. Journal of Open Source Software. 2021;6(60):3021.

A Implementation details

Experiments were run using the PyTorch37 library. We used the container implementation provided by NVIDIA GPU cloud (NGC).n6 Code to reproduce experimental results is available on GitHub.n7

A.1 Datasets

Table A1: Statistics of long-tailed datasets.
DatasetSamplesClassesTrain samples
train / val / testtotal (many / medium / few)many / medium / few
ImageNet‑LT115,846 / 20,000 / 50,0001,000 (385 / 479 / 136)88,693 / 25,510 / 1,643
Places‑LT62,500 / 7,300 / 36,500365 (131 / 163 / 71)52,762 / 8,934 / 804
CIFAR‑100‑LT10,847 /   / 10,000100 (35 / 35 / 30)8,824 / 1,718 / 305
iNaturalist437,513 /     / 24,4268,142 (842 / 4,076 / 3,224)258,340 / 133,061 / 46,112

Details about the long-tailed datasets used in our experiments are shown in Table A1. For ImageNet‑LT and Places‑LT, we used splits from Kang et al.6, available on GitHub.n8 For CIFAR‑100‑LT, we used the implementation of Wang et al.7, with imbalance factor 100, also available on GitHub.n9

A.1.1 Splits

For all datasets, the ‘many’, ‘medium’, and ‘few’ splits are defined using the same limits on per-class training samples: less than 20 for the ‘few’ split, between 20 and 100 for the ‘medium’ split, and more than 100 for the ‘many’ split. The actual minimum and maximum per-class training samples for each split are shown in Table A2.

A.2 Baseline models

Baseline model architectures are shown in Table A3. All models used backbones made of residual networks – ResNets38, and ResNeXts39. Whenever we refer to a model, the architecture corresponding to the dataset is used. For example, cRT used the ResNeXt‑50 architecture on ImageNet‑LT, and the ResNet‑152 architecture on Places‑LT.

For all models except LTR, we used model weights provided by the respective authors. For LTR, we retrained the model using code provided by the authors,n10 with some modifications: (1) for consistency, we used the same CIFAR‑100‑LT data splits used for training the RIDE model, and (2) we performed second stage training – fine-tuning with weight decay and norm thresholding – for a fixed 10 epochs.

Table A2: Minimum and maximum per-class training samples for long-tailed datasets.
DatasetMin. per-class samplesMax. per-class samples
many / medium / fewmany / medium / few
ImageNet‑LT101 / 20 / 51,280 / 100 / 19
Places‑LT103 / 20 / 54,980 / 100 / 19
CIFAR‑100‑LT102 / 20 / 5500 / 98 / 19
iNaturalist101 / 20 / 21,000 / 100 / 19

A.2.1 Feature and classifier extraction

Table A3: Baseline model architectures.
DatasetModelArchitecture
ImageNet‑LTcRTResNeXt‑50
LWSResNeXt‑50
RIDEResNeXt‑50
Places‑LTcRTResNet‑152
LWSResNet‑152
CIFAR‑100‑LTRIDEResNet‑32
LTRResNet‑34
iNaturalistcRTResNet‑152

In most cases, we simply used the output of a model’s penultimate layer as features, and used the weights (including bias) of the last layer as the classifier. Exceptions are listed below:

  • LWS: We multiplied classifier weights with the learned scales.

  • RIDE: We used the 6-expert teacher model, and saved classifiers from each expert after normalizing and scaling as in the model. For AlphaNet training, we created a single classifier by concatenating the expert classifier weights and biases. Similarly, features extracted from individual experts were concatenated after normalizing. During prediction, the individual experts and features were re-extracted, and experts were applied to their corresponding features to get 6 predictions, which were then averaged. So AlphaNet learned coefficients to update all 6 experts simultaneously.

  • LTR: We used the model fine-tuned with weight decay and norm thresholding. This creates a classifier with small norm, and correspondingly small prediction scores. So, during AlphaNet training, we multiplied all prediction scores by 100, which is equivalent to setting the softmax temperature to 0.01.

A.3 Training

For the main experiments, 5 nearest neighbors were selected for each ‘few’ split class, based on Euclidean distance. Hyper-parameter settings used during training are shown in Table A4. ImageNet‑LT and Places‑LT have a validation set, which was used to select the best model. This was controlled by the ‘minimum epochs’ parameter. After training for at least this many epochs, model weights were saved at the end of each epoch. Finally, the best model was selected based on overall validation accuracy, and used for testing. For CIFAR‑100‑LT and iNaturalist, we simply trained for a fixed number of epochs.

Table A4: Training hyper-parameters for main experiments.
ParameterValue
OptimizerAdamW40 with default parameters
Initial learning rate0.001
Learning rate decay0.1 every 10 epochs
Training epochs10 (CIFAR‑100‑LT and iNaturalist)
25 (ImageNet‑LT and Places‑LT)
Minimum epochs5
Batch size256 for iNaturalist, 64 for all others
AlphaNet architecture3 fully connected layers each with 32 units
Hidden layer activationLeaky-ReLU41 with negative slope 0.01
Weight initializationUniform sampling with bounds +/- (1 / √m)
where m is the number of input units to a layer

A.4 Results

All experiments were repeated 10 times from different random initializations, and unless specified otherwise, results are average values. In tables, the standard deviation is shown in superscript. We regenerated baseline results, and these match published values, except in the case of LTR, since it was retrained, and, for consistency, we did not use data augmentation at test time. Plots were generated with Matplotlib42, using the Seaborn library43. Error bars in figures represent 95% confidence intervals, estimated using 10,000 bootstrap resamples.


B Analysis of nearest neighbor selection

We analyzed the effect of the number of nearest neighbors kk, and the distance metric (μ\mu), on the performance of AlphaNet, using the cRT model on ImageNet‑LT. We compared two distance metrics:

  • Cosine distance: μ(z1,z2)=1z1Tz2\mu(z_1, z_2) = 1 - z_1^T z_2.
  • Euclidean distance: μ(z1,z2)=z1z22\mu(z_1, z_2) = \left\|{z_1 - z_2}\right\|_2.

For each distance metric, we performed 4 sets of experiments, with ρ\rho in {0.25,0.5,1,2}\left\{0.25, 0.5, 1, 2\right\}. For each ρ\rho, we varied kk from 2 to 10; all other hyper-parameters were kept the same as described in Appendix A.

The results are summarized in Figure B1, which shows per-split top‑1 accuracies against kk for different values of ρ\rho (ρ=2\rho=2 is omitted from this figure for space – no special behavior was observed for this case). We observe little change in performance beyond k=5k=5, and also observe similar performance for both distance metrics.

The full set of top‑1 and top‑5 accuracies is shown in the following tables:

(a)
(b)
(c)
Figure B1: Per-split test accuracies for α\alpha‑cRT on ImageNet‑LT versus the number of nearest neighbors kk. (a) ρ=0.25\rho=0.25 (b) ρ=0.5\rho=0.5 (c) ρ=1\rho=1
Table B1: Per-split test top‑1 accuracies for α‑cRT on ImageNet‑LT using k nearest neighbors based on Euclidean distance.
ModelFewMed.ManyOverall
cRT27.446.261.849.6
α‑cRT
ρ=0.25
k=128.729.0629.418.3643.519.7434.712.44
k=247.207.2430.911.4546.113.0838.909.54
k=343.002.4039.300.6355.800.4846.100.37
k=442.302.3640.001.0056.500.7946.700.51
k=543.201.5840.300.5756.800.4847.000.38
k=643.001.8140.701.0157.100.8447.300.56
k=743.501.9540.600.8157.000.6747.300.43
k=843.601.4740.700.5857.200.5047.500.34
k=942.602.2641.100.8857.500.7647.600.43
k=1042.801.0141.400.5857.700.4147.900.32
ρ=0.5
k=117.526.7636.516.9251.218.2539.611.49
k=242.809.5333.312.7748.414.2240.410.30
k=338.902.0641.000.4557.200.6046.900.40
k=439.102.5341.600.9057.900.6447.500.56
k=539.101.6442.500.5258.600.3948.200.19
k=639.402.0542.200.5158.500.4448.100.21
k=739.701.2142.400.5158.600.4448.300.29
k=838.301.5443.000.5559.000.3948.500.26
k=940.000.9942.700.3958.900.3348.500.21
k=1039.001.6443.000.5159.100.4148.700.21
ρ=1
k=134.128.6126.018.0939.919.5132.512.28
k=237.708.9337.610.6253.111.5643.608.36
k=335.401.4842.900.5059.000.4648.100.25
k=435.202.0243.400.5359.400.4648.400.30
k=535.601.9343.600.5959.600.4148.700.23
k=635.801.2343.600.3959.600.3248.700.24
k=736.101.3043.800.4559.700.3048.900.17
k=836.501.9043.700.4359.700.3648.900.16
k=936.101.8043.800.4559.800.4048.900.15
k=1035.602.0144.000.5660.000.4849.000.22
ρ=2
k=123.028.6133.018.0947.419.5137.212.28
k=229.901.8243.800.4559.600.4648.000.16
k=330.901.8644.200.3560.100.2948.500.16
k=431.202.0544.500.4060.300.3948.800.21
k=533.101.4544.300.3160.200.3248.900.15
k=632.001.6844.700.3460.500.2149.100.07
k=732.301.3844.800.3160.600.2549.200.14
k=832.201.7044.800.3860.600.2849.200.11
k=932.401.2244.800.2960.600.1849.200.07
k=1032.401.8544.900.4760.700.3749.300.14
Table B2: Per-split test top‑5 accuracies for α‑cRT on ImageNet‑LT using k nearest neighbors based on Euclidean distance.
ModelFewMed.ManyOverall
cRT57.373.481.874.4
α‑cRT
ρ=0.25
k=142.541.1263.711.2874.88.1165.12.93
k=271.407.4465.507.0976.15.0470.44.33
k=367.301.2670.400.3279.60.2173.50.22
k=467.801.3970.500.3579.70.2673.70.14
k=568.200.8970.700.3979.80.2473.90.23
k=668.101.2870.800.4679.90.3273.90.17
k=768.900.8670.700.3779.80.2673.90.19
k=868.700.7170.800.2579.80.1874.00.13
k=968.401.2770.900.3479.90.2474.00.13
k=1068.500.6971.000.3379.90.2174.10.16
ρ=0.5
k=126.637.9068.010.3777.97.4566.22.68
k=268.109.4266.607.3676.95.2270.84.26
k=364.701.6270.900.3180.00.2673.60.18
k=465.002.0671.200.5180.10.3073.80.32
k=565.300.7871.600.1780.40.1474.10.08
k=665.901.0571.300.2180.30.1874.00.12
k=766.200.9371.500.3480.30.2474.10.16
k=865.400.9471.600.3080.40.2074.10.13
k=966.300.7171.600.1780.40.1374.20.09
k=1066.000.9071.600.2580.50.1674.30.10
ρ=1
k=150.140.5261.611.0873.37.9764.52.87
k=263.408.2469.005.9578.64.2071.93.37
k=361.901.3271.800.3280.60.2673.80.15
k=462.001.3672.000.2380.70.1774.00.15
k=562.701.2572.000.2780.80.2074.10.11
k=663.200.8372.000.2280.80.1874.20.13
k=763.501.0772.000.2580.70.1574.20.09
k=864.201.0672.000.2380.70.1574.30.14
k=963.701.2672.100.2780.80.1774.30.07
k=1063.601.5172.100.3480.80.2174.30.08
ρ=2
k=134.440.5265.911.0876.47.9765.62.87
k=256.801.3872.200.3080.90.2373.50.16
k=357.801.4872.400.1981.10.2073.80.17
k=458.801.3272.400.2581.10.1973.90.18
k=560.901.1572.300.2780.90.1874.10.12
k=660.001.3172.600.1781.20.1374.20.09
k=760.500.9772.600.1981.10.1574.20.09
k=860.601.1572.600.2181.20.2074.30.06
k=960.900.9972.600.1981.20.1374.30.04
k=1060.901.3472.600.2381.20.1674.30.08
Table B3: Per-split test top‑1 accuracies for α‑cRT on ImageNet‑LT using k nearest neighbors based on cosine distance.
ModelFewMed.ManyOverall
cRT27.446.261.849.6
α‑cRT
ρ=0.25
k=117.426.9936.516.9751.318.1539.611.45
k=244.306.7134.110.3450.111.5741.608.50
k=344.305.1936.108.0852.209.0643.406.69
k=442.302.1239.800.7456.400.7046.500.52
k=542.702.8140.601.0456.900.9347.200.52
k=644.301.8940.000.9856.500.8146.900.54
k=743.201.6040.600.9757.100.7747.300.55
k=843.302.3140.701.0157.100.8647.400.52
k=942.901.2341.000.7257.500.6247.600.45
k=1043.201.5440.700.5957.200.4447.400.32
ρ=0.5
k=128.629.4429.518.5143.819.8034.912.49
k=239.506.7938.208.6554.209.5344.506.91
k=338.801.7841.200.8257.500.7847.100.49
k=438.801.8841.400.3857.800.4447.400.30
k=539.102.4742.000.7258.200.6247.800.33
k=639.401.4542.600.6758.700.4448.400.31
k=740.301.1942.400.4758.500.4348.300.28
k=840.701.3542.200.6058.400.4248.300.34
k=939.801.0842.600.4058.800.3148.500.22
k=1039.901.1742.800.4958.900.3148.600.24
ρ=1
k=139.826.9922.516.9736.218.1530.111.45
k=237.009.4437.411.5853.112.3443.409.03
k=333.801.4943.200.5259.200.4848.100.25
k=435.901.1043.300.3859.200.2448.400.27
k=534.701.9943.400.5259.500.3948.400.22
k=635.501.7943.900.4159.800.4248.900.14
k=737.101.7243.400.4859.400.4348.700.22
k=836.101.4743.800.4059.800.3148.900.15
k=935.601.3744.000.3260.000.3049.000.14
k=1035.401.6844.000.5160.000.3349.000.15
ρ=2
k=123.028.8533.018.1447.519.4037.212.24
k=229.101.6444.000.3659.900.2548.100.15
k=330.901.8944.100.5960.000.4548.400.23
k=431.602.1344.300.4960.300.3848.700.20
k=532.502.4044.500.5760.300.4148.900.17
k=630.801.7644.900.3560.700.3149.000.17
k=732.401.8544.800.3460.600.3049.200.12
k=831.501.5245.000.2760.700.2149.200.13
k=932.901.4144.800.2560.600.2249.300.10
k=1031.902.1645.000.4260.700.3349.300.08
Table B4: Per-split test top‑5 accuracies for α‑cRT on ImageNet‑LT using k nearest neighbors based on cosine distance.
ModelFewMed.ManyOverall
cRT57.373.481.874.4
α‑cRT
ρ=0.25
k=126.338.1468.010.3778.07.4666.22.65
k=268.807.1267.506.2877.64.4671.63.76
k=369.005.0368.704.7278.43.2172.52.84
k=467.401.5070.500.2979.80.2073.70.20
k=568.001.8370.800.3979.90.2673.90.14
k=669.101.0970.600.3779.70.2473.90.15
k=768.701.3070.700.4379.80.3174.00.18
k=868.901.4570.700.3279.80.2573.90.09
k=968.700.8270.800.3579.90.2774.00.18
k=1069.000.6970.600.3079.70.1973.90.21
ρ=0.5
k=142.141.6163.711.3274.98.1365.12.89
k=264.506.5969.505.1279.03.6072.52.96
k=364.201.9571.200.4880.10.3373.70.17
k=465.501.1071.100.2480.10.1573.80.18
k=565.501.5571.300.3180.20.2173.90.13
k=665.400.9571.700.2680.50.1774.20.11
k=766.400.8271.500.2580.30.1674.20.13
k=866.800.8871.400.2880.20.2074.20.17
k=966.300.9371.600.2880.40.1974.30.13
k=1066.400.6871.600.1980.40.1474.30.12
ρ=1
k=157.938.1459.410.3771.87.4664.02.65
k=263.209.2868.906.3578.64.4371.93.50
k=360.601.4471.900.3480.60.2673.70.12
k=462.201.0272.000.3280.70.1574.00.17
k=562.401.4871.800.2580.70.1674.00.12
k=662.801.1072.200.2080.80.1474.20.09
k=764.101.3571.900.2280.60.1574.20.11
k=863.501.0372.100.2080.80.1474.30.09
k=963.401.0672.100.2380.80.1274.30.12
k=1063.301.2272.200.2680.90.1674.30.07
ρ=2
k=134.240.7765.911.0976.47.9765.62.83
k=255.701.4072.400.1381.00.0973.40.18
k=358.101.8372.300.3681.00.2373.70.12
k=459.401.5272.400.2681.00.1973.90.17
k=560.201.7072.500.2581.10.1974.20.12
k=659.301.5272.600.3181.20.1674.10.15
k=760.501.5072.600.1881.20.1574.30.10
k=860.001.2472.600.1981.20.1174.20.16
k=961.000.9072.600.1481.10.1074.30.09
k=1060.501.5272.700.2881.20.1674.30.04

C Analysis of training data sampling

This section contains results for AlphaNet training with a range of ρ\rho values. Training was performed following the same procedure as described in Appendix A. We also include results for the iNaturalist dataset using the cRT model. For iNaturalist, we used smaller values of ρ\rho given the much smaller differences in per-split accuracy.

The results are summarized in Figure Figure C1, Figure C2, which show change in per-split top‑1 and top‑5 accuracy respectively, versus ρ\rho (iNaturalist results are omitted from these figures due to the different set of ρ\rhos used).

Detailed results, organized by dataset, are shown in the following tables:

In addition to top‑1 and top‑5 accuracy, we evaluated performance on ImageNet‑LT by considering predictions to a WordNet36 nearest neighbor as correct.n11 Given a level ll, if the predicted class for a sample is within ll nodes of the true class in the WordNet hierarchy (using the shortest path), it is considered correct. We used l=4l=4, and these results are shown in Table C3.

Figure C1: Change in per-split top‑1 accuracy vs. ρ\rho for AlphaNet training with different models and datasets.
Figure C2: Change in per-split top‑5 accuracy vs. ρ\rho for AlphaNet training with different models and datasets.
Table C1: Top‑1 accuracy on ImageNet‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
cRT27.446.261.849.6
α‑cRT
ρ=0.147.61.6037.10.7653.80.6245.00.41
ρ=0.245.71.5639.00.7855.70.7946.30.52
ρ=0.342.91.2940.40.7257.00.6847.10.47
ρ=0.440.81.8541.50.7257.80.5447.70.36
ρ=0.539.71.4242.00.6658.30.5248.00.37
ρ=0.7537.41.9342.90.4659.00.4048.30.16
ρ=134.61.8843.70.5159.70.4348.60.24
ρ=1.2535.01.9843.60.7059.60.5048.60.35
ρ=1.532.62.4644.40.4960.30.3848.90.19
ρ=1.7532.31.4244.40.3260.30.1848.90.14
ρ=231.51.9944.70.4660.50.3049.00.12
ρ=329.02.0545.10.3660.90.2849.00.08
LWS30.447.260.249.9
α‑LWS
ρ=0.153.90.7729.41.2244.21.2238.51.03
ρ=0.252.01.2133.31.4448.01.3741.51.08
ρ=0.349.82.2035.91.5950.41.4743.41.10
ρ=0.448.71.1737.41.1351.60.9544.40.80
ρ=0.546.90.9838.60.8752.90.8645.30.69
ρ=0.7545.31.8940.21.2854.41.0146.30.76
ρ=141.61.6142.20.5356.00.3247.40.30
ρ=1.2542.51.4142.10.7456.00.5347.50.41
ρ=1.540.11.9943.20.9856.90.7648.00.53
ρ=1.7539.42.5343.50.8857.10.7048.20.45
ρ=237.52.9444.30.8657.90.7248.60.32
ρ=334.51.9145.30.5458.70.3749.00.21
RIDE36.554.468.957.5
α‑RIDE
ρ=0.149.20.6949.50.4065.40.1655.60.19
ρ=0.247.10.8650.50.3466.00.2056.00.16
ρ=0.346.11.1651.20.5066.50.2956.40.21
ρ=0.444.71.1351.70.3966.90.2456.60.22
ρ=0.543.50.7552.30.2667.30.1756.90.11
ρ=0.7541.70.6352.80.1867.70.1557.00.12
ρ=140.81.0053.10.2167.90.1857.10.11
ρ=1.2539.01.2853.40.2268.20.1657.20.05
ρ=1.538.21.2253.60.2568.40.1757.20.06
ρ=1.7537.70.8953.70.1468.40.1057.20.05
ρ=237.10.9853.80.1268.50.1157.20.06
ρ=334.51.2354.30.1468.80.1157.20.08
Table C2: Top‑5 accuracy on ImageNet‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
cRT57.373.481.874.4
α‑cRT
ρ=0.171.70.8369.40.2578.90.1673.40.11
ρ=0.269.60.9370.30.4179.50.2773.70.21
ρ=0.367.91.0370.80.4579.80.3173.90.21
ρ=0.466.41.2771.10.3680.10.2373.90.19
ρ=0.565.71.0871.40.3980.30.2574.00.16
ρ=0.7564.41.0871.60.2180.50.1474.10.10
ρ=162.41.7171.90.2980.70.2074.00.16
ρ=1.2562.41.2672.00.3780.70.2674.10.18
ρ=1.560.41.6772.40.2481.00.1874.10.17
ρ=1.7560.40.8172.30.1981.00.1174.00.13
ρ=259.51.3072.60.2581.10.1974.10.14
ρ=357.41.4572.90.2181.40.1774.00.11
LWS61.573.781.675.1
α‑LWS
ρ=0.177.90.6866.50.5975.70.6571.60.50
ρ=0.275.70.9068.20.7377.10.6772.70.50
ρ=0.374.11.4669.20.6777.90.6073.20.42
ρ=0.473.20.6269.60.4978.30.3973.50.35
ρ=0.572.10.9070.10.4978.70.4673.70.32
ρ=0.7570.71.3570.80.4879.30.4174.00.22
ρ=168.40.7671.40.3479.80.2374.20.21
ρ=1.2568.71.0471.50.3579.80.3074.30.19
ρ=1.567.31.1471.90.4580.10.3674.40.23
ρ=1.7566.71.4872.10.3880.30.2974.50.21
ρ=265.01.7872.50.3280.60.2474.60.15
ρ=363.20.7872.80.2380.80.1574.60.18
RIDE67.979.485.080.0
α‑RIDE
ρ=0.178.60.4874.40.2681.70.1677.80.13
ρ=0.276.80.8275.10.2782.20.2278.10.14
ρ=0.375.81.1175.90.5282.60.3378.40.23
ρ=0.474.70.8976.30.3582.90.2478.60.20
ρ=0.573.70.5476.90.1083.30.0778.90.09
ρ=0.7572.10.5677.40.2583.60.1879.10.14
ρ=171.21.1677.70.3083.80.2479.20.17
ρ=1.2569.21.3578.20.2384.20.1379.30.12
ρ=1.568.61.4178.50.2484.40.1779.40.11
ρ=1.7568.10.9978.60.1584.40.1179.40.10
ρ=267.11.0578.80.1684.60.1279.40.08
ρ=363.71.4579.30.1884.90.0979.30.10
Table C3: ImageNet‑LT accuracy computed by considering predictions within 4 WordNet nodes as correct, for AlphaNet applied to different models.
ModelFewMed.ManyOverall
cRT46.559.771.062.3
α‑cRT
ρ=0.157.81.0253.50.4865.70.3958.80.27
ρ=0.256.60.8254.90.5166.70.5359.70.38
ρ=0.355.00.7455.80.6567.80.4060.30.39
ρ=0.453.91.0256.50.5568.30.4060.70.31
ρ=0.553.20.7256.90.4968.70.3260.90.29
ρ=0.7551.81.2157.30.3569.10.2961.10.15
ρ=150.41.0657.70.3569.60.2761.30.18
ρ=1.2550.61.0957.80.4969.60.3461.30.27
ρ=1.549.21.3958.40.2870.00.2561.60.13
ρ=1.7549.00.8158.40.1770.00.1361.60.11
ρ=248.51.0658.70.3370.10.1861.70.10
ρ=347.01.1858.90.2370.40.1861.70.06
LWS48.360.469.862.4
α‑LWS
ρ=0.161.50.5547.71.0858.90.6253.90.74
ρ=0.260.40.9450.61.1861.50.8956.10.80
ρ=0.359.11.1852.51.0563.20.9957.50.74
ρ=0.458.40.5853.70.7364.10.6058.30.52
ρ=0.557.50.6054.40.9164.80.6258.80.64
ρ=0.7556.51.0655.61.0065.90.6659.70.59
ρ=154.40.8956.90.3667.00.1860.50.23
ρ=1.2555.00.8556.80.5967.00.3860.50.33
ρ=1.553.61.1657.80.7667.50.4960.90.41
ρ=1.7553.11.3157.90.5967.80.4561.10.32
ρ=252.11.3658.50.4568.30.4861.40.24
ρ=350.70.8059.10.2968.80.2361.70.15
RIDE53.866.476.368.5
α‑RIDE
ρ=0.161.60.4662.90.3073.70.1166.90.17
ρ=0.260.40.4863.70.2474.30.1967.30.14
ρ=0.359.70.7664.20.4074.60.2767.60.21
ρ=0.458.90.7064.50.2874.90.2267.80.20
ρ=0.558.20.3965.00.1375.20.0968.00.08
ρ=0.7557.00.3965.40.1875.50.1368.10.13
ρ=156.50.7465.50.1775.70.1468.20.11
ρ=1.2555.30.9065.80.1475.90.1268.20.07
ρ=1.554.60.8865.90.1576.00.1268.30.04
ρ=1.7554.50.6266.00.0876.00.0668.30.06
ρ=254.10.7066.00.1076.10.1168.30.04
ρ=352.20.7666.30.0776.30.0968.20.06
Table C4: Top‑1 accuracy on Places‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
cRT24.937.642.036.7
α‑cRT
ρ=0.138.21.3030.40.8437.30.7234.40.41
ρ=0.234.81.4232.40.6639.00.2935.30.15
ρ=0.333.21.5733.40.7539.50.6035.60.25
ρ=0.432.41.4733.80.6939.80.4635.70.22
ρ=0.531.00.8834.50.1740.40.2935.90.09
ρ=0.7528.61.2735.50.4041.00.1336.10.11
ρ=127.01.0236.10.3141.30.1336.20.10
ρ=1.2526.91.2336.00.4741.30.1436.10.16
ρ=1.525.50.8936.50.3641.60.2136.20.11
ρ=1.7525.41.3236.50.2941.50.2336.20.11
ρ=224.91.3836.60.4941.50.2036.10.10
ρ=322.31.5637.30.2841.90.2036.10.14
LWS28.739.140.637.6
α‑LWS
ρ=0.142.51.0329.51.0134.40.9433.80.58
ρ=0.241.31.3031.00.8135.50.8134.60.44
ρ=0.338.71.1333.21.1237.00.7035.60.51
ρ=0.438.01.4533.61.1837.30.6535.80.50
ρ=0.537.11.3934.40.8037.70.5236.10.31
ρ=0.7535.51.2535.30.5938.50.2836.50.16
ρ=134.60.9735.80.5438.60.3936.60.22
ρ=1.2532.61.2836.80.6039.30.3936.90.19
ρ=1.532.21.1737.20.3639.50.3937.00.11
ρ=1.7531.71.3537.30.4339.50.2737.00.09
ρ=230.91.2837.60.4039.80.1837.10.14
ρ=327.81.9438.50.5140.20.2937.10.10
Table C5: Top‑5 accuracy on Places‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
cRT56.370.274.068.9
α‑cRT
ρ=0.167.11.2264.90.6971.50.3067.70.32
ρ=0.264.21.6166.40.7472.30.3568.10.25
ρ=0.362.71.3267.10.6372.60.3468.20.24
ρ=0.462.21.0667.40.5572.70.3568.30.23
ρ=0.560.80.6068.00.1973.00.1568.40.19
ρ=0.7558.91.0768.60.3473.30.2068.40.21
ρ=157.11.2969.20.3073.50.1668.40.19
ρ=1.2557.20.9469.00.2973.50.1968.30.18
ρ=1.555.70.8569.40.2973.60.2668.20.22
ρ=1.7555.81.0969.40.2573.70.1568.30.09
ρ=255.21.4969.50.3873.60.2468.20.20
ρ=352.41.8370.10.3573.90.1568.00.25
LWS60.270.873.469.7
α‑LWS
ρ=0.170.50.9464.01.0270.20.4467.50.43
ρ=0.269.80.7664.80.6970.70.3967.90.33
ρ=0.367.70.8666.40.9471.40.4168.40.43
ρ=0.466.91.7466.81.1571.60.6068.60.41
ρ=0.566.41.1067.20.7071.70.4168.60.34
ρ=0.7564.60.8168.20.4772.20.2768.90.20
ρ=163.81.2068.50.4872.30.3469.00.21
ρ=1.2562.51.3169.10.4972.60.2869.10.13
ρ=1.562.11.0469.40.2572.80.1769.20.16
ρ=1.7561.61.1769.50.4472.80.2469.20.20
ρ=261.01.3369.60.4472.80.2369.10.15
ρ=358.32.1570.30.4673.20.2669.00.23
Table C6: Top‑1 accuracy on CIFAR‑100‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
RIDE25.852.169.350.2
α‑RIDE
ρ=0.137.61.6439.40.9957.81.4045.30.40
ρ=0.234.01.4243.21.0162.10.6447.10.58
ρ=0.333.91.0244.21.3762.91.4047.70.87
ρ=0.432.11.4745.50.7464.40.8348.10.28
ρ=0.532.31.2445.90.8764.60.7848.40.43
ρ=0.7528.41.5648.30.5866.80.3748.80.33
ρ=127.61.4149.50.8367.40.7049.20.16
ρ=1.2526.10.9849.50.5367.80.2148.90.35
ρ=1.525.21.1150.20.5768.30.3449.00.26
ρ=1.7524.71.5650.90.6268.70.4949.30.19
ρ=224.41.4350.80.7368.70.5449.20.23
ρ=322.11.3051.70.6869.30.4449.00.30
LTR29.849.370.150.7
α‑LTR
ρ=0.138.21.4233.52.9162.91.8745.21.10
ρ=0.238.10.7836.21.5264.11.5346.50.69
ρ=0.337.21.4436.22.4565.00.9346.61.07
ρ=0.436.11.9038.60.8064.72.1347.00.76
ρ=0.536.21.3939.52.1163.44.1846.91.95
ρ=0.7534.31.5942.50.5966.01.7848.20.43
ρ=132.21.7743.41.2267.50.9548.50.73
ρ=1.2531.21.7046.00.6766.63.9148.81.11
ρ=1.532.01.4946.20.8667.31.0249.30.33
ρ=1.7530.52.0546.41.0067.80.7549.10.34
ρ=230.81.7247.51.0068.11.2249.70.40
ρ=329.91.7249.00.8668.61.0150.10.20
Table C7: Top‑5 accuracy on CIFAR‑100‑LT, using AlphaNet applied to different models.
ModelFewMed.ManyOverall
RIDE68.880.686.379.1
α‑RIDE
ρ=0.175.51.3467.62.5980.01.3874.31.33
ρ=0.272.81.1772.42.5682.51.1576.11.20
ρ=0.372.51.1174.81.6783.70.7177.20.63
ρ=0.470.81.2474.91.0384.00.6076.80.50
ρ=0.571.01.2575.41.5884.00.7277.10.77
ρ=0.7568.31.4077.90.9185.20.2477.60.57
ρ=166.91.1479.50.5386.00.2978.00.20
ρ=1.2565.20.9679.30.7985.90.3377.30.55
ρ=1.565.21.5579.80.3286.10.2077.60.50
ρ=1.7564.61.4380.50.5086.40.2677.80.29
ρ=264.31.4580.60.5986.40.2877.70.26
ρ=362.01.2481.10.7386.80.3877.40.38
LTR69.372.080.874.3
α‑LTR
ρ=0.172.71.2568.80.4780.00.1273.90.25
ρ=0.272.60.5168.70.3280.10.0773.90.22
ρ=0.372.30.9869.20.3980.10.0674.00.33
ρ=0.471.41.7169.80.4980.20.0873.90.57
ρ=0.571.60.8870.00.4680.20.1174.10.29
ρ=0.7569.92.1470.30.4680.30.1173.70.71
ρ=168.51.7970.40.5380.50.1773.40.56
ρ=1.2568.01.3971.10.4580.60.1273.50.46
ρ=1.568.12.1071.10.4780.60.1973.50.71
ρ=1.7566.22.5971.10.4280.60.1873.00.70
ρ=267.41.5771.20.4780.70.1173.40.38
ρ=367.12.0071.70.5780.80.1773.50.48
Table C8: Top‑1 accuracy on iNaturalist, using AlphaNet applied to cRT.
ModelFewMed.ManyOverall
cRT69.271.975.771.2
α‑cRT
ρ=0.0176.10.6054.92.5465.52.1164.41.43
ρ=0.0274.40.7759.11.5668.61.2766.21.05
ρ=0.0374.20.6661.11.2270.00.8567.20.59
ρ=0.0473.70.6961.61.6170.40.9467.30.86
ρ=0.0573.30.6862.81.1471.00.6867.80.69
Table C9: Top‑5 accuracy on iNaturalist, using AlphaNet applied to cRT.
ModelFewMed.ManyOverall
cRT87.788.189.888.1
α‑cRT
ρ=0.0190.30.3382.31.3386.30.7985.90.65
ρ=0.0289.50.2883.51.2087.20.8186.30.62
ρ=0.0389.20.2784.40.5587.90.3286.70.28
ρ=0.0489.20.2584.40.7987.90.5186.70.40
ρ=0.0588.70.3384.70.6788.10.4186.60.32

D Change in per-class accuracies

In this section, we analyze the change in accuracy for individual classes after applying AlphaNet (following the training process described in Appendix A). First, we plotted the sorted accuracy changes, grouped by split. These are shown in the following figures:

We also plotted accuracy change for classes against the average distance to their 5 nearest neighbors by Euclidean distance. For ‘few’ split classes, we selected neighbors from the ‘base’ split, and for the ‘base’ split classes, we selected neighbors from the ‘few’ split. Recall that classifiers for the ‘few’ split were updated using classifiers from these neighbors. Results are shown in the following figures:

Figure D1: Change in per-class test accuracy on ImageNet‑LT after AlphaNet training with cRT baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D2: Change in per-class test accuracy on ImageNet‑LT after AlphaNet training with LWS baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D3: Change in per-class test accuracy on ImageNet‑LT after AlphaNet training with RIDE baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D4: Change in per-class test accuracy on Places‑LT after AlphaNet training with cRT baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D5: Change in per-class test accuracy on Places‑LT after AlphaNet training with LWS baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D6: Change in per-class test accuracy on CIFAR‑100‑LT after AlphaNet training with RIDE baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
Figure D7: Change in per-class test accuracy on CIFAR‑100‑LT after AlphaNet training with LTR baseline. Each bar shows the change the change in accuracy for one class. The solid lines in each split show the average per-class change for the split, and the dotted line shows the overall average per-class change.
(a)
(b)
(c)
Figure D8: Change in per-class test accuracy on ImageNet‑LT, versus mean distance to 5 nearest neighbors based on Euclidean distance. The neighbors are from ‘base’ split for the ‘few’ split classes, and vice-versa for the ‘base’ split classes. The lines are regression fits, and the rr values are Pearson correlations. (a) cRT baseline (b) LWS baseline (c) RIDE baseline
(a)
(b)
Figure D9: Change in per-class test accuracy on Places‑LT, versus mean distance to 5 nearest neighbors based on Euclidean distance. The neighbors are from ‘base’ split for the ‘few’ split classes, and vice-versa for the ‘base’ split classes. The lines are regression fits, and the rr values are Pearson correlations. (a) cRT baseline (b) LWS baseline
(a)
(b)
Figure D10: Change in per-class test accuracy on CIFAR‑100‑LT, versus mean distance to 5 nearest neighbors based on Euclidean distance. The neighbors are from ‘base’ split for the ‘few’ split classes, and vice-versa for the ‘base’ split classes. The lines are regression fits, and the rr values are Pearson correlations. (a) RIDE baseline (b) LTR baseline

  1. Results for the 6-expert model are presented in the GitHub repository for the original paper at github.com/frank-xwang/ RIDE-LongTailRecognition/blob/main/MODEL_ZOO.md.↩︎

  2. The ‘base’ split is the complement of the ‘few’ split, composed of classes with many training samples.↩︎

  3. Low-shot learning is also referred to as few-shot learning, and as one-shot learning if only a single training example is available per class.↩︎

  4. Another popular dataset for evaluating long-tail models is iNaturalist32. However, models are able to achieve much more balanced results on this dataset, compared to other long-tailed datasets. For example, with the cRT model, ‘few’ split accuracy (69.2%) is only 2 points lower than the overall accuracy (71.2%). So the dataset does not represent a valid use case for our proposed method, and we omitted the dataset from our main experiments. Results for this dataset are included in the appendix (Appendix C).↩︎

  5. For example, suppose there are 2 ‘base’ classes – class 1 has 10 samples, and class 2 has 100 samples. Then, each class 1 sample is assigned a weight of 0.1, and each class 2 sample is assigned a weight of 0.01. Sampling with this weight distribution, both classes have a 50% chance of being sampled.↩︎

  6. catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch, version 22.06.↩︎

  7. github.com/jayanthkoushik/alphanet.↩︎

  8. github.com/facebookresearch/classifier-balancing.↩︎

  9. github.com/frank-xwang/RIDE-LongTailRecognition.↩︎

  10. github.com/ShadeAlsha/LTR-weight-balancing.↩︎

  11. This is only possible for ImageNet‑LT since image labels correspond to WordNet synsets.↩︎