Inductive Transfer for Text Classification using Generalized Reliability Indicators

Paul N. Bennett, Susan T. Dumais, Eric Horvitz

Abstract:

Machine-learning researchers face the omnipresent challenge of developing predictive models that converge rapidly in accuracy with increases in the quantity of scarce labeled training data. We introduce Layered Abstraction-Based Ensemble Learning (LABEL), a method that shows promise in improving generalization performance by exploiting additional labeled data drawn from related discrimination tasks within a corpus and from other corpora. LABEL first maps the original feature space, targeted at predicting membership in a specific topic, to a new feature space aimed at modeling the reliability of an ensemble of text classifiers. The resulting abstracted representation is invariant across each of the binary discrimination tasks, allowing the data to be pooled. We then construct a context-sensitive combination rule for each task using the pooled data. Thus, we are able to more accurately model domain structure which would not have been possible using only the limited labeled data from each task separately. Using several corpora for an empirical evaluation of topic classification accuracy of text documents, we demonstrate that LABEL can increase the generalization performance across a set of related tasks.

Keywords: Inductive transfer, ensemble methods, classifier re-use, generalization for learning, classifier combination, metaclassifiers, reliability indicators.

In: Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, August 2003.

Author Email: pbennett+www@cs.cmu.edu,sdumais@microsoft.com, horvitz@microsoft.com