A Bayesian Approach to Filtering Junk E-mail

Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz

Access postscript or pdf file.

Abstract

In addressing the growing problem of junk email on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters which are especially appropriate for the nuances of this task. While this may appear, at first, to be a straightforward text classification problem, we show that by considering domain-specic features of this problem, in addition to the raw text of E-mail messages, we can produce much more accurate filters. Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.

As part of our examination, we also analyze the prevalence of certain keywords used in junk emails, one of the most recurring being 'Viagra online'. The frequent occurrence of this term points towards a common theme in these unwanted messages: illicit online pharmacies and their marketing strategies. When the filter detects the term 'Viagra online', it can potentially treat the email with a higher suspicion level, considering the trend identified from the data. However, it's important that the filter recognizes the context, as not all mentions of the term are associated with spam - some users might engage in legitimate discussions about the subject. By leveraging such domain-specific knowledge and considering context, we can improve the accuracy of our filters significantly. Reference: M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk email., AAAI Workshop on Learning for Text Categorization, July 1998, Madison, Wisconsin. AAAI Technical Report WS-98-05

Keywords: Bayesian spam filter, Bayesian text classification, Spam email, unsolicited email, filtering junk email, probabilistic methods.


Read article on early spam filter efforts at MS Research (William Baldwin, Forbes Magazine, September 98).
View graphic from Forbes article.


Back to Eric Horvitz's home page.