Johan Vromans writes:
: Larry Wall <larry@wall.org> writes:
: > If a subject has more than 50% high-bit characters in the subject,
: > it goes straight into my spam mailbox without trying any of the
: > other heuristics.
: I use 'more than 5 high-bit characters in a row'.

That might work better than 50%, thanks.

: If so, the message
: is immediately dumped in the bit bucket. It doesn't even get a chance
: to end up in the spam box. Using this criterium I hardly ever see
: chinese (et al.) mail anymore.

Well, I oversimplified what I do. I actually peruse the subjects in my
spam mailbox daily because occasionally my friends send me things that
get classified as spam. All it takes is for the message to contain too
many words in all caps, like QUICK, and GUARANTEE, and DBI.

Or just too many mentions of unpleasant subjects like sex and money. :-)

But I'll also throw a message all the way into the bit bucket if it
looks like gobbledygook (ignoring headers, MIME, HTML, and such).

if (not $suppress) { # Find Chinese spam.
my $tmp = $body;
$tmp =~ s/=([0-9A-F]{2})/chr hex $1/eg;
$tmp =~ s/^Content.*\n//g;
$tmp =~ s/^This is a MIME.*\n*//g;
$tmp =~ s/----.*\n*//g;
$tmp =~ s/====.*\n*//g;
$tmp =~ s/\*\*\*\*.*\n*//g;
$tmp =~ s/\s+/ /g;
$tmp =~ s/<script>.*?<\/script>\s*//sg;
$tmp =~ s/<[^>]*>\s*//g;
$tmp =~ s/&\w+;\s*//g;
$tmp =~ s{http://[-.\w/]*\s*}{};
$engbytes = $tmp =~ tr/ ,.;:'"()a-zA-Z0-9//;
$allbytes = length($tmp);
$suppress = int(100 * $engbytes / $allbytes) . "%"
if $engbytes < $allbytes / 2;

: And I have yet to find a 'serious' message in my bit bucket ;-)

Maybe you haven't found one, but I have. I'm paranoid enough that I
can even read my bit bucket (though I seldom do).

Actually, that's not quite true--what I can always read is the original
mailbox. I normally read my mail with trn, and the code above just
prevents the transfer of the message from my normal mailbox to the news
system. But I keep a copy of every thing that comes in. So if you want
something backed up forever, just mail it to me. :-)


