Multi-label Classification: A Guided Tour

Introduction

I recently undertook some work that looked at tagging academic papers with one or more labels based on a training set.

A preliminary look through the data revealed about 8000 examples, 2750 features, and…650 labels. For clarification, that’s 2750 sparse binary features (keyword indices for the articles), and 650 labels, not classes. Label cardinality (average number of labels per example) is about 2, with the majority of labels only occurring a few times in the dataset…doesn’t look good, does it? Nevertheless, more data wasn’t available and label reduction wasn’t on the table yet, so I spent a good amount of time in the corners of academia looking at multi-label work.

What is multi-label classification? While multiclass maps a single class to each example, multi-label classification maps multiple labels to each example. For instance, in my case each article could be tagged with anywhere from 0 to 650 labels. Here’s a description of some sample datasets frequently cited in the literature:

screen-shot-2017-01-27-at-11-24-33-pm

I was first surprised to re-learn that multi-label classification as a topic of its own, or even as an extension of multiclass and binary classification, is relatively underrepresented in standard texts and curriculum. For example, Sebastian Raschka’s excellent Python Machine Learning, which covers enough motivation, math, and code for a strong survey of machine learning methods makes no mention of “multilabel” or “multi-label” (conventions for appropriate spelling aren’t clear to me). Nor do more well known texts like Programming Collective Intelligence or, even more surprisingly, Bishop’s book so far as I can see. In most cases, the survey of multi-label learning is diffused throughout academic articles, which not everyone has the time or inclination to read.

Even more problematic: chances are, your favorite ML framework has limited mention of or capacity for handling multi-label problems other than as modified multiclass problems. For example, scikit-learn’s documentation is a little thin and subtle when it comes to implementing and adapting the library to multi-label problems. Fortunately, a variety of projects extending and supplementing the capabilities of big ML frameworks with multi-label techniques have sprung up as a result (MEKA, scikit-multilearn, Mulan, etc.)

It might be assumed that multi-label is more of an applied extension of single classification tasks than anything else, and indeed I believe the relative lack of literature on the subject might be due to the fact that in many cases you can get away with transforming a multi-label problem into a multiclass problem (in most instances, a set of one-vs-the-rest classifiers) and be done with it.

Yet given that multi-label problems do not always resolve themselves neatly into one-vs-the-rest problems as readily as multiclass problems (particularly if you have a large, sparse set of labels), I thought it would be a good idea to share what I’ve learned from looking at multi-label methods and solutions off the beaten track.

Hopefully, this guide will help you more quickly navigate your multi-label problems, and may even provide you with some new general purpose tools for your machine learning toolbox.

Dependence

Here’s a spoiler: the field of multi-label classification is all about dependence between labels. Most of what’s written is either about transforming these dependencies in the data to fit well-known algorithms, or it’s about new algorithms that take advantage of these dependencies to improve performance.

The presence of a given label for a given training example in many cases provides information about the likelihood of other labels applied to that example. For instance, if we are trying to label objects within an image and determine that a boat exists in the center of the image, this tells us something about the likelihood of other labels (sail, buoy, seagull) being assigned to the objects in this image. By interpreting the multi-label problem as a discrete multiclass or one-vs-the-rest problem and using binary relevance methods, we assume independence and lose valuable information that could be exploited to improve our classifier.

Screen Shot 2017-01-25 at 5.10.57 PM.png

It’s useful to parse the notion of dependence a little further. Let’s look at [Dembczynski et al. 2010]’s definitions of conditional and unconditional dependence.

Given the set of $q$ labels $Y$ , the labels are unconditionally dependent if the joint is not the product of the marginals:

$p(Y) \neq \prod_{i=1}^{q} p^{(i)} (Y_i)$

This is basically saying there’s a violation of independence $P(A|B) \neq P(A)$ across all labels (refer to the probability chain rule), meaning the probability that label A belongs to an example changes based on whether label B has been assigned to that example, no matter what example we’re looking at. Unconditional dependence is readily measured with mutual information:

$I(Y_j; Y_k) = \sum\limits_{y_j \in {0,1}} \sum\limits_{y_k \in {0,1}} log (\frac {p(y_j, y_k)} {p(y_j)p(y_k)})$

On the other hand, conditional dependence refers to dependence between labels conditioned on an event x:

$p_x(Y) \neq \prod_{i=1}^{q} p_x^{(i)} (Y_i)$

where $p_x(Y) = p(Y|x)$

Here we’re saying that the dependence between labels only appears given some conditional context of the example we’re looking at. The probability that label A belongs to an example doesn’t necessarily change based on whether label B has been assigned to that example, but it does change depending on whether or not we’ve observed feature x in that example.

Unconditional dependence doesn’t imply conditional dependence, as you might expect. It’s not quite as obvious that conditional dependence doesn’t imply unconditional dependence, but the authors construct a simple example that demonstrates this nicely. In this case, we’re looking at a joint distribution with binary feature $x$ and two binary labels $y_1, y_2$ :

screen-shot-2017-01-27-at-4-40-26-pm

screen-shot-2017-01-27-at-4-40-43-pm

The unconditional / conditional distinction lets us formalize dependency and more clearly understand which algorithms address which kind of dependency and how they do it.

A very closely related topic worth mentioning falls in the realm of collective classification, though I won’t go into details here. While the unconditional / conditional distinction lets us model label dependencies, what about the relationship between feature dependencies and label dependencies? For example, if we are interested in labeling academic articles given the authors and titles, how might we model the increased probability that Author A’s article should be assigned the label “machine learning” given that her frequent collaborators Author B and Author C have been assigned the “data mining” label. [Kong, Shi, Yu, 2010] call this intra-instance cross-label dependency ((a) and (b) below just correspond to unconditional and conditional dependence):

screen-shot-2017-01-27-at-5-52-25-pm

Anyway, if that’s interesting to you, literature in this area is called collective classification and focuses on exploiting the relational dependencies between instances and features instead of looking only at label dependencies. Of course, this only applies if your data exhibits relational properties like, say, a graph.

In conclusion, multi-label classification is all about dependence, and a successful multi-label approach is one that exploits information about label dependencies.

Approaches

The two approaches for multi-label classification are data transformation and algorithm transformation. Simply put, transform the data to fit multiclass algorithms, or modify the algorithms to fit multi-label data, though most strategies will often mix both approaches.

Data Transformation

Data transformation is the more obvious option if you’re unfamiliar with multi-label problems. It involves changing the target variables into multiclass form. For example, given a problem with $q$ different labels:

Binary Relevance (BR), or one-vs-the-rest: create $q$ binary classifiers
Label Powerset: create a multiclass problem with $2^q$ classes
Pairwise Binary, or all-vs-all: $\frac {q(q-1)}{2}$ binary classifiers

As mentioned before, BR is what multiclass uses and is one of the more common performance benchmarks. BR assumes independence, so there are a number of ideas about modifying BR to account for label dependence. Chaining [Read et al. 2009], for example, iteratively includes the prediction output of previous BRs as features for new models.

The Label Powerset method has a few disadvantages, typically a large quantity of class values distributed sparsely across the dataset. And supposing only a subset of powerset values are available, how can they be extended to new instances? Ensemble methods like RAkEL work like random forests by aggregating random powerset subsets.

Additional ideas focus on decreasing the dependence between labels and then applying multiclass algorithms. In some cases dimensionality reduction can be effective (unmodified PCA, unmodified LDA) in reducing the dependence between label classes.

Hierarchy strategies take advantage of the existing hierarchical structure of labels – or create hierarchies based on the label features – in order to construct subproblems with reduced label dependence and computational complexity.

Hierarchy of Multilabel Classifiers (HOMER) [Tsoumakas et al. 2008] uses a novel variant of k-means to create balanced clusters of labels (subset of total labels, balanced example distribution) grouped by label similarity, thereby creating homogeneous clusters and increasing independence between high-order clusters. These are structured into a hierarchy and a classifier is implemented at each node of the hierarchy.

Screen Shot 2017-01-28 at 2.44.21 PM.png

Algorithm Transformation

Algorithm adaptation involves modifying classifiers to make them multi-label capable. For example:

ML-kNN: kNN adapted to assign training example $x$ to the most common labels of the k nearest neighbors instead of the majority class (combined with Bayesian inference) [Zhang and Zhou, 2007]
Decision trees with multi-label entropy, multiple labels per leaf [Clare and King, 2001] for an extension of C4.5 algorithm for multi-label
Random Forest using Predictive Clustering Trees [Kocev et al. 2013]
Neural networks with multiple outputs (cost function based on ranking of label relevancy) [Zhang and Zhou, 2006]
Adaboost.MH (minimizing Hamming Loss) and Adaboost.MR (minimizing ranking error) [Schapire and Singer, 1999]

While traditional decision trees are modified to account for multi-label by summing entropies for each individual class label…

$Entropy_{ML} = \sum\limits_{i=1}^{q} [P(y_i) log(P(y_i)) + (1-P(y_i)) log(1-P(y_i))]$

Predictive clustering trees (PCT) view a decision tree as a hierarchy of clusters; the tree is created by maximizing the variance (in this case a function of Gini indices) caused by partitioning the data, making each cluster as homogenous as possible, and labeling each leaf with its cluster’s prediction. Random forests using PCT stumps are among the best multi-label classification algorithms, as will be discussed later.

Partitioning the Data

While train/test holdout methods are fairly straightforward, cross-validation is somewhat tricky. Assuming you want to ensure partitions have approximately the same distribution as the original dataset – in other words, stratified cross-validation – off-the-shelf methods are limited. Reflect that within a multi-label context, it’s not clear exactly what stratification means. Should we make sure the subset of powerset of labels $labelset \in 2^q$ are well-distributed between partitions? Shouldn’t we be more concerned with the distribution each individual label – but given that labels are fixed in their label subsets, how do we manage this? [Sechedis et al. 2011] created a scheme that works by identifying the prevalence of each label class across the dataset and distributing each class evenly across the CV folds beginning with the rarest and ending with the most popular. The insight is that rare labels have the strictest distribution constraints due to their low availability, so we should distribute them first. We can worry about and repair the distribution of popular labels across the folds later due to their availability, but this wouldn’t work if we tried to go the other way.

Anyway, the paper does some benchmarking with the iterative method, and provides some metrics for understanding whether your cross-validation split is doing a good job of representing the dataset and reducing performance variance between folds. For example, Labels Distribution looks at positive and negative examples for each label in $q$ labels for each fold $F$ in $k$ folds against the distribution in the whole dataset $D$ :

$Labels Distribution = \frac {1}{q} \sum\limits_{i=1}^q (\frac {i}{k} \sum\limits_{j=1}^k | \frac {|F_j^i|} {|F_j| - |F_j^i|} - \frac {|D^i|} {|D| - |D^i|} | )$

The authors do some benchmarking wrt the iterative method described above, a powerset method and random splitting with the conclusion that random sampling for multi-label (which, unless you have a special function for multi-label CV, you are probably using) performs worse than powerset and iterative methods. I haven’t implemented the stratification methods but found the label distribution metrics coupled with classifier performance across folds useful for a deeper understanding of variance in label distribution. Taking a look at distribution, cardinality, and associated classifier performance for a given label is a powerful combination for interpreting results.

Evaluation Metrics

Precision, recall, F-measure, ROC…a lot of the evaluation metrics you’re familiar with from multiclass don’t readily translate to multi-label because they fail to capture the case of a predicted label set being partially correct. For example, something like an accuracy score (also called 0/1 loss) in the context of multiclass:

$Accuracy_{MC} = \frac{1}{q} \sum\limits_{i=1}^{q} {I}(\hat{y}_i = y_i)$

(where ${I}$ is the indicator function returning 1 if true, 0 otherwise)

is straightforward for multiclass but underspecified for multi-label. If we regard the whole label set for a given example as a single instance for the indicator function (did we identify all and only correct labels) the metric is probably too strict, as you will find if implementing with sklearn. We would, for example, receive a 0 accuracy score both in the case where we predict no labels correctly for each sample and in the case where we predict 9/10 labels correctly for each sample. In the latter case, we would want the evaluation metric to reflect our partial correctness. In the multi-label setting, this accuracy metric is called exact-match.

To capture the notion of partial correctness one can use metrics that fit into two categories: example-based and label-based. In example-based, average difference between predicted and actual labels is evaluated for each example, and then over all examples in the test set. In label-based, each label is evaluated first (across all examples where it shows up) and then averaged over all labels.

screen-shot-2017-01-22-at-7-18-39-pm

Hamming Loss (from Hamming distance) is the fraction of labels that are incorrectly predicted, and is perhaps the most widely used for multi-label learning:

$Hamming Loss = \frac {1} {q} \sum\limits_{i=1}^{q} {I}(\hat{y}_i \neq y_i)$

In the context of multi-label learning, accuracy is defined as the number of correct labels divided by the union of predicted and actual:

$Accuracy_{ML} = \frac {1}{q} \sum\limits_{i=1}^q \frac {| y_i \cap \hat{y} |}{| y_i \cup \hat{y} |}$

Ranking metrics can take into account the ranking of predicted labels.

One-Error measures how many times the top ranked predicted label is not in the set of true labels of the instance:

$One-Error = \frac {1}{n} \sum\limits_{i=1}^n {I}({arg min}_{\lambda \in q} r_i (\lambda) \notin\hat{y}_i)$

where $r_i (\lambda)$ is the predicted rank of class label $\lambda$ for a given instance.

Coverage Error computes the average number of labels that have to be included in the final prediction in order to cover all the true labels. In other words, for a given example if you manage to match all the actual labels with your first m predictions, your coverage is m for that example. For n examples,

$Coverage = \frac {1}{n} \sum\limits_{i=1}^n max_{\lambda \in \hat{y}_i} r_i (\lambda) - 1$

Precision, recall, and F1 (harmonic mean of precision and recall). In multi-label problems, the micro- measures tends to be more informative than the macro- measures because macro- measures weight all of the labels equally while micro- measures are averaged over example/label pairs. The micro/macro difference for precision below extends to recall and F1:

$macro-precision = \frac {1}{q} \sum\limits_{i=1}^q \frac {TruePositive_i} {TruePositive + FalsePositive}$

$micro-precision = \frac {\sum\limits_{i=1}^q TruePositive_i} {\sum\limits_{i=1}^q TruePositive_i + \sum\limits_{i=1}^q FalsePositive_i}$

Evaluation Implementation

Optimization needs a direction, and improving classifier performance is not self-explanatory for multi-label. Before building the classifier, clarifying the end goal is necessary. Are incorrect labels a problem? How important is it to capture all correct labels? Do we need to capture the top n ranked labels? On top of the sensitivity/specificity threshold tradeoff, ranking metrics introduce another dimension of considerations.

Jesse Read demonstrates how optimization under one metric can decrease performance under another metric. Optimizing Hamming Loss reduces 0/1 score, and vice versa.

screen-shot-2017-01-24-at-11-27-51-pm

screen-shot-2017-01-24-at-11-28-18-pm

Often, the best strategy is to clearly define the classifier objective beforehand and try out multiple contradictory or competing evaluation measures to keep track of classifier performance in a number of different respects.

State-of-the-Art Methods and Evaluation

Screen Shot 2017-01-27 at 11.18.23 PM.png

Which algorithms have the best performance historically?

[Madjarov et al. 2012] provides a deeper overview (with references for empirical results) of the classifier space and performance. The authors identified the methods above as state-of-the-art and compared performance on a variety of benchmark multi-label datasets using ranking-based, label-based, and example-based evaluation metrics.

There are exceptions and special cases, but the broad conclusion is that the best performing methods are – in descending order –

RF-PCT (random forest using predictive clustering trees): CLUS8
HOMER: Mulan
Binary Relevance: Mulan
Classifier Chains: Meka

Implementation

Most state-of-the-art algorithms can be found in Mulan and Meka.

It’s important to note for the algorithms and metrics discussed, what you see in the literature might not be what you get in the implementation library even if multi-label support is explicitly given. Support of multi-label is often equivalent to data-transformation into BR or multi-output. For example, sklearn KNN explicitly supports multilabel, but a look at the source code shows that KNN has been adapted for multi-output, but doesn’t necessarily exploit dependence with Bayesian inference as proposed in ML_KNN.

However, if you plan to follow a data transformation approach you should be able to translate dependency information into a new data structure for multiclass or multi-output algorithms. Most state-of-the-art multi-label algorithms and methods discussed here can all be found in Mulan and Meka.

If you’re not willing to dive straight into unfamiliar libraries, the following approach should nevertheless get you good performance on multi-label problems:

Compute some basic metrics on your dataset: number of labels, number of features, label cardinality, etc. Compare your dataset for similarity to the datasets here in Table 1, and then look at the B Tables at the bottom of the paper to see which approach works best for the dataset most similar to yours.
Start with binary relevance and see how the data performs on a number of different metrics. This should provide some indication of label dependence. The label distribution formula and mutual information should help in this respect as well.
If performance is poor you can manually group correlated labels together, or explore hierarchical or dimensionality reduction methods to further group labels.
If performance is still unsatisfactory and this poor performance is not obviously a result of your dataset (too many classes, not enough data), try HOMER and RF-PCT

References and Further Reading

There are a few great resources out there for multi-label learning. Jesse Read is the co-creator of MEKA, classifier chains, and did his PhD on multi-label learning. His materials are a great place to start, and nearly all of the images in this guide are swiped from his talks and papers.

Beyond the papers linked throughout the guide in reference to specific topics, the following resources each provide a good overview of the field in its entirety:

Powerpoint Presentation on Multi-Label Classification [2013] by Jesse Read Part1 and Part2

Jesse Read’s PhD Thesis on Multi-Label Classification [2010]

A Literature Survey on Algorithms for Multi-label Learning [2010] by Sorower

A Tutorial on Multi-Label Learning [2015] by Gibaja and Ventura

Multilabel Learning: A Review of the State of the Art and Ongoing Research [2014] by Gibaja and Ventura