If you're like most people, your email inbox is a catch basin for the daily torrent of offers for sex performance-enhancing drugs, weight-loss programs and Third World moneymaking schemes.
Me? I get very few, thanks to a fairly straightforward application of statistical analysis applied to text processing -- Bayesian filtering.
Thomas Bayes didn't have spam in 1760 when he wrote his "Essay Towards Solving a Problem in the Doctrine of Chances," which introduced the mathematical basis for statistical inference. In layman's terms, it's the calculation that something will occur in the future, based on the number of times it hasn't occurred in the past. When applied to email, a Bayesian filter will calculate the likelihood that I will want to participate in a money-laundering scheme with the widow of a former African president or deposed minister based on the number of times I have deleted their previous messages.
Bayesian filtering works very well, measured at greater than 99.9% accuracy at detecting spam and other unwanted emails.
In his paper, "A Plan for Spam," Paul Graham popularized Bayesian filtering on word groupings to prevent spammers from getting their messages across. After all, how can you sell a new genital enlargement cream without using the words "enlargement" or "bigger" in conjunction with certain other words? Eventually, spammers are reduced to incoherence through vocabulary starvation.
What if I do want discount prescription offers? Bayesian filters will "learn" that, too. Statistical filtering can be very specific to a given user's requirements.
SpamAssassin is an inline mail filtering system that can plug into an individual's mail-delivery stream or incorporate into a mail relay using the "milter" interface for sendmail, postfix, Qmail or other MTAs. Its approach combines signatures, whitelists, shared checksums, message analysis (looking for signs of malformation in message headers that indicate a spoofed delivery) and Bayesian filtering. It throws the kitchen sink at messages, and anything that survives probably isn't spam. SpamAssassin is best installed in a forwarding mail gateway at the boundary of an enterprise.
Eric Raymond's Bogofilter is a general message filter that maintains a Bayesian index of bogus elements for each message. By inserting such a measurement into message processing, a mail client can be set to automatically trash spam messages. Bogofilter hooks into a personal mail filtering system like procmail. It's better suited as an individual antispam system than something like SpamAssassin.
Bill Yerazunis' CRM114 -- Controllable Regexp Mutilator -- is oriented toward individual use as a procmail filter or generalized message filter. CRM114 is more of a message filtering and categorization system for an entire environment. It has a somewhat bizarre language in which users can write specific filters to chop up and analyze components of a data stream for "distinctness." Beyond spam detection, it has a lot of potential uses. As Yerazunis points out, it might be a good tool for syslog analysis or firewall event log processing. To my knowledge, no one is using it for those purposes, but it's definitely a cool tool.
There are a lot of interesting areas in which Bayesian filters can expand. For example, they could treat embedded HTML directives as text, rather than attempting to discriminate URLs or other markup elements. The next generation of Bayesian filters may look at frequencies in which words occur together, or even frequencies in which chains of words occur, to find restricted messages.
My prediction: The guys coding Bayesian filters will continue to refine the technique and dramatically reduce spammers' ability to get a coherent message across -- perhaps to the point where they turn to some other medium. We can only hope.
About the author:
Marcus J. Ranum is a senior scientist with TruSecure Corp. He is the founder of NFR Security and built the first commercial firewall product, DEC SEAL. He is the author of The Myth of Homeland Security (Wiley, 2003).