In this chapter, we use our taxonomy to classify different types of AL strategies. In each section and each type of strategy, we will make a short description at the beginning, then provide a more detail. And at the end, we will list the representative works under the category (with a short note).
We note that here we doesn't take batch mode as a dimension in our taxonomy. If you are only care about how to apply batch selection, please check here. The classification problems here include binary and multi-class classification (even some works can only be applied to binary classification). There also are some works focus on multi-class classification settings, please check here.
- Pool-Based Active Learning for Classification
- Taxonomy
- Categories
In pool based AL, the strategy is in fact a scoring function for each instance to judge how much information it contains for the current task. Previous works calculate their scores in different ways. We summarize them into the following catagories.
Intuition | Description | Comments |
---|---|---|
Informativeness | Uncertainty by the model prediction | Usually refers to how much information instances would bring to the model. |
Representativeness-impart | Represent the underlying distribution | Normally used with informativeness. This type of methods may have overlaps with batch-mode selection. |
Expected Improvements | The improvement of the model's performance | The evaluations usually take more time. |
Learn to score | Learn an evaluation function directly. | |
Others | Could not classified into the previous categories |
The informativeness usually refers to how much information it would bring to the model. Thus, the evaluation is usually depending on the current trained model.
This is the most basic strategy of AL. It aims to select the instances which are most uncertain to the current model. There are basically three sub-strategies here.
- Classification uncertainty
- Select the instance close to the decision boundary.
- Classification margin
- Select the instance whose probability to be classified into to the two most likely classes are most close.
- Classification entropy
- Select the instance whose have the largest classification entropy among all the classes.
The equations and details could see here.
Works:
- Heterogeneous uncertainty sampling for supervised learning [1994, ICML]: Most basic Uncertainty strategy. Could be used with probabilistic classifiers. (1071 citations)
- Support Vector Machine Active Learning with Applications to Text Classification [2001, JMLR]: Version space reduction with SVM. (2643 citations)
- How to measure uncertainty in uncertainty sampling for active learning [2021, Machine Learning]
This types of methods need a group of models. The sampling strategy is basing on the output of the models. The group of models are called committees, so this type of works are also named Query-By-Committee (QBC). The intuition is that if the group of committees are disagree with the label of an unlabeled instance, it should be informative in the current stage.
- Disagreement measurement
- Vote entropy
- Consensus entropy
Works:
- Query by committee [1992, COLT]: QBC. The idea is to build multiple classifiers on the current labeled data by using the Gibbs algorithm. The selection is basing on the disagreement of classifiers. (1466 citations)
- Query learning strategies using boosting and bagging [1998, ICML]: Avoid Gibbs algorithm in QBC. Ensemble learning with diversity query. (433 citations)
- Diverse ensembles for active learning [ICML, 2004]: Previous QBC are hard to make classifiers very different from each other. This method use DECORATE to build classifiers. C4.5 as base learner. Outperform 4. Select the instance with the highest JS divergence. (339 citations)(Delete)
- Bayesian active learning for classification and preference learning [2011, Arxiv]: Bayesian Active Learning by Disagreement (BALD). Seek the x for which the parameters under the posterior (output by using the parameters) disagree about the outcome (output by using the labeled dataset) the most. (149 citations)
- The power of ensembles for active learning in image classification [2018, CVPR]
- Consistency-Based Semi-supervised Active Learning: Towards Minimizing Labeling Cost [2021, Springer]: A semi-supervised AL method.
If the selected instance will bring the largest model change, it could be considered as the most informative instance under the current task.
Works:
- An analysis of active learning strategies for sequence labeling tasks [2008, CEMNL]: Expected gradient change in gradient based optimization.(659 citations)
- Influence Selection for Active Learning [2021]: Select instances with the most positive influence on the model. It was divided to an expected gradient length and another term.
- Model-Change Active Learning in Graph-Based Semi-Supervised Learning [2021]
The informativeness of instances could be defined in many other ways.
Works:
- Optimizing Active Learning for Low Annotation Budgets [2021]: Select the samples with the maximum shift from certainty to uncertainty.
- Active Learning by Acquiring Contrastive Examples [2021, EMNLP]: CAL. Take the inconsistency of predictions with the neighbors as the selection criteria. Believe the data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods should be quired.
- On The Effectiveness of Active Learning by Uncertainty Sampling in Classification of High-Dimensional Gaussian Mixture Data [2022, PerCom Workshops]: Add area-under-margin as informative measurements.
- ALLSH: Active Learning Guided by Local Sensitivity and Hardness [2022]: Select the instances whose predictive likelihoods diverge the most from their perturbations.
- Active Learning by Feature Mixing [2022, CVPR]: The instance with the representation which could maximally influence the output of the anchor labeled instance (by feature mixing) could be informative.
- Gaussian Switch Sampling: A Second Order Approach to Active Learning [2023, TAI]: The forgettable data (classified correctly at time t and subsequently misclassified at a later time) should be informative.
- Bayesian Estimate of Mean Proper Scores for Diversity-Enhanced Active Learning [2023, TPAMI]
Previous introduced works seldomly consider the data distributions. So those strategies are more focusing on the decision boundary, and the representativeness of the data is neglected. Therefore, many works take the representativeness of the data into account. Basically, it measures how much the labeled instances are aligned with the unlabeled instances in distribution. We note that there aren't many works only consider the representativeness of the data. More commonly, the representativeness and informativeness are considered together to sample instances.
The simplest idea is to use cluster structure to guide the selection. The cluster could either be applied on the original features or the learned embeddings.
- Cluster-based sampling:
- Pre-cluster
- Hierarchical sampling
- Cluster on other types of embedding
Works:
- Active learning using pre-clustering [2004, ICML]: (483 citations)
- Hierarchical Sampling for Active Learning [2008, ICML]: Take into account both the uncertainty and the representativeness. The performance heavily depends on the quality of clustering results. (388 citations)
- Ask-n-Learn: Active Learning via Reliable Gradient Representations for Image Classification [2020]: Use kmeans++ on the learned gradient embeddings to select instances.
- D-CALM: A Dynamic Clustering-based Active Learning Approach for Mitigating Bias [2023]
These types of strategies take into account the distribution and local density. The intuition is that the location with more density is more likely to be queried. i.e. the selected instances and the unlabeled instances should have similar distributions.
- Density-based sampling:
- Information density
- RALF
- k-Center-Greedy (Core-set): Only consider the representativeness.
Works:
- An analysis of active learning strategies for sequence labeling tasks [2008, CEMNL]: Information density framework. The main idea is that informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution (i.e., inhabit dense regions of the input space).(659 citations)
- RALF: A Reinforced Active Learning Formulation for Object Class Recognition [2012, CVPR]: RALF. A time-varying combination of exploration and exploitation sampling criteria. Include graph density in the exploitation strategies. (59 citations)
- Active learning for convolutional neural networks: A core-set approach [ICLR, 2018]: Core-set loss is simply the difference between average empirical loss over the set of points which have labels for and the average empirical loss over the entire dataset including unlabelled points. Optimize the upper bound of core-set loss could be considered as a k-center problem in practice. Doesn't need to know the out put of the current model.
- Minimax Active Learning [2020]: Develop a semi-supervised minimax entropy-based active learning algorithm that leverages both uncertainty and diversity in an adversarial manner.
- Multiple-criteria Based Active Learning with Fixed-size Determinantal Point Processes [2021]
- In Defense of Core-set: A Density-aware Core-set Selection for Active Learning [2022, KDD]: Observe that locally sparse regions tend to have more informative samples than dense regions. The strategy is to estimate the density of the unlabeled samples and select diverse samples mainly from sparse regions.
This type of works directly takes into account the measurement of distribution alignment between labeled/selected data and unlabeled data. i.e. The labeled and the unlabeled instances should hard to be distinguished. There are adversarial works and non-adversarial works.
Types:
- Adversarial based
- non-adversarial based
Works:
- Exploring Representativeness and Informativeness for Active Learning [2017, IEEE TRANSACTIONS ON CYBERNETICS]: Optimization based. The representativeness is measured by fully investigating the triple similarities that include the similarities between a query sample and the unlabeled set, between a query sample and the labeled set, and between any two candidate query samples. For representativeness, our goal is also to find the sample that makes the distribution discrepancy of unlabeled data and labeled data small. For informativeness, use BvSB. (85 citations)
- Discriminative Active Learning [2019, Arxiv]: Make the labeled and unlabeled pool indistinguishable.
- Agreement-Discrepancy-Selection: Active Learning with Progressive Distribution Alignment [2021]
- Dual Adversarial Network for Deep Active Learning [2021, ECCV]: DAAL.
- Multi-Classifier Adversarial Optimization for Active Learning [2023, AAAI]
Many works only score the instance by the expected performance on the labeled data and the selected data. Some other works also take into account the expected loss on the rest unlabeled data as a measurement of representativeness.
- Expected loss on unlabeled data:
- QUIRE
- ALDR+
Works:
- Active Learning by Querying Informative and Representative Examples [2010, NeurIPS]: QUIRE. Optimization based. Not only consider the loss in the labeled data (uncertainty) but also consider the loss in the unlabeled data (representations, correct labels would leads small value of overall evaluation function.). This methods is very computationally expensive. (370 citations)
- Efficient Active Learning by Querying Discriminative and Representative Samples and Fully Exploiting Unlabeled Data [2020, TNNLS]: ALDR+. This paper also provide a new taxonomy in AL classification, which includes three parts: criteria for querying samples, exploiting unlabeled data and acceleration. In this paper, they provide a method take all three parts into account.
Pre-divide the pool into batches by a certain why. Then select from each batches. Except the pre-cluster, there are other criteria to prepare the batches:
- Divide by loss on auxiliary tasks (self-supervised tasks).
- Divide by certain distance.
Works:
- Using Self-Supervised Pretext Tasks for Active Learning [2022]
- BAL: Balancing Diversity and Novelty for Active Learning [2024, TPAMI]
Our learning purpose is to reduce the generalization error at the end (in other word, have a better performance at the end). From this perspective, we can select the instances which could improve the performance for each selection stage. Because we don't know the true label of the instance we are going to selecting, normally the expected performance is calculated for each instance. These methods normally need to be retrained for each unlabeled instances in the pool, so it could be really time consuming.
- Expected improvement
- Error Reduction: Most directly, reduce the generalization error.
- Variance Reduction: We can still reduce generalization error indirectly by minimizing output variance.
- Entropy Change: The reduction of prediction entropy on the evaluation set after adding a new item.
Works:
- Toward optimal active learning through sampling estimation of error reduction [2001, ICML]: Error Reduction
- Combining active learning and semisupervised learning using Gaussian fields and harmonic functions [2003, ICML]: Greedily select queries from the unlabeled data to minimize the estimated expected classification error. GRF as basic model. This is also the most computationally expensive query framework since it needs to iterate over the whole unlabeled pool. (517 citations)
- Active learning for logistic regression: an evaluation [2007, Machine Learning]: Variance Reduction
- Uncertainty-aware active learning for optimal Bayesian classifier [2021, ICLR]: ELR tends to stuck in local optima; BALD tends to be overly explorative. Propose an acquisition function based on a weighted form of mean objective cost of uncertainty (MOCU).
- Information Gain Sampling for Active Learning in Medical Image Classification [2022]: Expected information gain (EIG) measures the reduction of entropy.
All the mentioned sampling strategies above are basing on heuristic approaches. Their intuitions are clear, but might perform differently in different datasets. So some researchers purposed that we can learn a sampling strategy from the sampling process.
- Learn to score
- Learn a strategy selection method: select from heuristics
- Learn a score function: learn a score function
- Learn a AL policy (as a MDP process)
Works:
- Active learning by learning [2015, AAAI]: ALBL. A single human-designed philosophy is unlikely to work on all scenarios. Given an appropriate choice for the multi-armed bandit learner, take the importance-weighted-accuracy as reward function (an unbiased estimator for the test accuracy). It is possible to estimate the performance of different strategies on the fly. SVM as underlying classifier.(41 citations)
- Learning active learning from data [2017, NeurIPS]: LAL. Train a random forest regressor that predicts the expected error reduction for a candidate sample in a particular learning state. Previous works they cannot go beyond combining pre-existing hand-designed heuristics. Random forest as basic classifiers. (Not clear how to get test classification loss l. It is not explained in both the paper and the code.)(73 citations)
- Learning how to Active Learn: A Deep Reinforcement Learning Approach [2017, Arxiv]: PAL. Use RL to learn how to select instance. Even though the strategy is learned and applied in a stream manner, the stream is made by the data pool. So under my angle, it could be considered as a pool-based method. (92)
- Learning How to Actively Learn: A Deep Imitation Learning Approach [2018, ACL]: Learn an AL policy using imitation learning, mapping situations to most informative query datapoints. (8 citations)
- Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning [2018, Arxiv]
- Learning Loss for Active Learning [2019, CVPR]: Attach a small parametric module, named “loss prediction module,” to a target network, and learn it to predict target losses of unlabeled inputs.
- Learning to Rank for Active Learning: A Listwise Approach [2020]: Have an additional loss prediction model to predict the loss of instances beside the classification model. Then the loss is calculated by the ranking instead of the ground truth loss of the classifier.
- Deep Reinforcement Active Learning for Medical Image Classification [2020, MICCAI]: Take the prediction probability of the whole unlabeled set as the state. The action as the strategy is to get a rank of unlabeled set by a actor network. The reward is the different of prediction value and true label of the selected instances. Adopt a critic network with parameters θ cto approximate the Q-value function.
- ImitAL: Learning Active Learning Strategies from Synthetic Data [2021]: An imitation learning approach.
- Cartography Active Learning [2021, EMNLP]: CAL. Select the instances that are the closest to the decision boundary between ambiguous and hard-to-learn instances.
- Deep reinforced active learning for multi-class image classification [2022]
- ImitAL: Learned Active Learning Strategy on Synthetic Data [2022]
- Algorithm Selection for Deep Active Learning with Imbalanced Datasets [2023]
- Reinforced Active Learning for Low-Resource, Domain-Specific, Multi-Label Text Classification [2023, ACL]
- BatchGFN: Generative Flow Networks for Batch Active Learning [2023, ICML workshop]
- Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes [2023, ECML PKDD]
There still are other works uses innovative heuristics. It is a little bit hard to classify those works for now. So we put these works under this section. These works might be classified later.
Self-paced:
- Self-Paced Active Learning: Query the Right Thing at the Right Time [AAAI 2019]: Borrow the idea in the self-paced learning that the learning from easy instances to hard instances would improve the performance.
- Combination of Active Learning and Self-Paced Learning for Deep Answer Selection with Bayesian Neural Network [2020, ECAI]: Just a combination. Not use the idea of the self-paced learning into AL.
- Self-paced active learning for deep CNNs via effective loss function [2021, Neurocomputing]: The selection criteria selects instances with more discrepancy at the beginning and instances with more uncertainty later. They also have a similarity classification loss function in their model to ensure the effectiveness of the representations.
Utilize historical evaluation results:
- Looking Back on the Past: Active Learning with Historical Evaluation Results [TKDE, 2020]: Model the AL process as a ranking problem and use the learned rank results to select instance.
Hybrid:
- HAL: Hybrid active learning for efficient labeling in medical domain [2021, Neurocomputing]
- How to Select Which Active Learning Strategy is Best Suited for Your Specific Problem and Budget [2023]: SelectAL. Combine representative and uncertainty under different budgets.
- A More Robust Baseline for Active Learning by Injecting Randomness to Uncertainty Sampling [2023, ICML workshop]